SlideShare a Scribd company logo
Term paper on Data mining
How to use Weka for data analysis
Submitted by: Shubham Gupta (10BM60085)

Vinod Gupta School of Management
The first technique that we would do on weka is classification. The data below shows the financial
situation in Japan. The data has been collected from 1970-2009. The columns represent:

    1)   BROAD: Broad money supplied in the economy
    2)   DOMC: Domestic consumption
    3)   PSC: Payment securities
    4)   CLAIMS: Represents the claims on the government.
    5)   TOTRES: Total Reserves
    6)   GDP: Gross domestic product
    7)   LIQLB: Liquid Liability

We want to get a decision tree that would help us decide what values of independent variable may
result in what final rule or result. For example if we know that for a DOMC> 140 and PSC> 150.3 we
would always get say GDP of greater than 3 trillion yen, then it would help us in making our decisions
better. Hence to get such rules we perform this analysis to generate a decision tree.


  YEAR       BROAD     CLAIMS     DOMC       PSC         TOTRES        LIQLB           GDP
  1970       83.65      61.88     134.25    111.75     4876114550     104.73      205,995,000,000
  1971       106.70     21.37     147.59    123.72    15469150615     118.21      232,681,000,000
  1972       116.14     23.17     160.29    133.47    18932675966     129.03      308,137,000,000
  1973       116.02     19.84     157.87    132.20    13723930639     126.07      418,640,000,000
  1974       113.08     13.72     154.00    126.49    16551248298     120.50      464,705,000,000
  1975       118.31     13.02     164.40    129.96    14910849997     127.56      505,317,000,000
  1976       122.40     12.09     169.96    130.63    18590784646     131.20      567,926,000,000
  1977       125.82      8.76     172.45    128.49    25907710023     133.90      698,968,000,000
  1978       130.36      8.56     178.29    127.71    37824744320     139.12      982,078,000,000
  1979       135.51      8.19     183.05    129.23    31926244737     142.67    1,022,190,000,000
  1980       137.95      8.09     188.44    131.29    38918848626     144.30    1,071,000,000,000
  1981       142.13      8.04     194.09    134.10    37839039769     150.03    1,183,790,000,000
  1982       149.54      7.67     203.99    139.59    34403732201     156.18    1,100,410,000,000
  1983       156.55      6.72     213.12    145.03    33844549531     162.92    1,200,190,000,000
  1984       159.31      6.69     217.77    147.43    33898638541     165.34    1,275,560,000,000
  1985       160.68      7.66     220.09    149.90    34641202378     167.41    1,364,160,000,000
  1986       167.30      7.67     230.23    156.30    51727320082     174.65    2,020,890,000,000
  1987       175.85     12.27     243.85    173.48    92701641597     183.77    2,448,670,000,000
  1988       178.70     10.66     251.68    182.52    1.06668E+11     186.47    2,971,030,000,000
  1989       182.62     10.13     258.13    190.28    93672771034     192.14    2,972,670,000,000
  1990       184.06      8.46     259.15    194.81    87828362969     190.16    3,058,040,000,000
  1991       184.35      5.20     257.54    195.40    80625855126     189.32    3,484,770,000,000
  1992       187.89      4.16     265.33    199.63    79696644593     190.93    3,796,110,000,000
  1993       193.97      1.33     274.00    202.14    1.07989E+11     198.16    4,350,010,000,000
  1994       200.35      1.88     281.02    204.58    1.35146E+11     204.45    4,778,990,000,000
1995       205.79        1.26      287.13   203.90   1.9262E+11       209.90    5,264,380,000,000
   1996       209.72        1.81      292.42   205.21   2.25594E+11      213.63    4,642,540,000,000
   1997       215.31        6.47      276.47   217.76   2.26679E+11      221.38    4,261,840,000,000
   1998       229.64        1.80      298.40   228.01   2.22443E+11      233.17    3,857,030,000,000
   1999       239.91        -1.20     309.92   231.08   2.93948E+11      243.22    4,368,730,000,000
   2000       242.24        -1.58     308.91   222.28   3.61639E+11      243.84    4,667,450,000,000
   2001       225.31       -33.25     299.43   193.01   4.01958E+11      187.41    4,095,480,000,000
   2002       207.79        -4.32     299.16   182.40   4.69618E+11      190.79    3,918,340,000,000
   2003       209.70       -1.99      307.26   180.71   6.73554E+11      191.84    4,229,100,000,000
   2004       207.51       -1.10      303.48   174.12   8.44667E+11      189.79    4,605,920,000,000
   2005       207.24       1.79       312.85   182.87   8.46896E+11      189.30    4,552,200,000,000
   2006       204.73       -0.14      304.96   179.99   8.95321E+11      186.06    4,362,590,000,000
   2007       201.50       0.16       294.31   172.56   9.73297E+11      184.17    4,377,940,000,000
   2008       207.14       0.76       295.42   165.48   1.03076E+12      189.52    4,879,860,000,000
   2009       223.76       -1.12      320.53   171.00   1.04899E+12      206.13    5,032,980,000,000


Loading data in Weka is quite easy. Just click on the open file option and give the location of the file.




Figure 1 Shows how to load data in Weka

Weka software is used to classify the above data to find out how these economical factors be modified
or fixed so as to get an 11% growth in the previous year’s GDP
Figure 2 Diagram shows where you could the used tree technique

The following shows the output by running the above data in Weka. The Classifier used is to create the
required decision tree is M5P. Weka's M5P algorithm is a rational reconstruction of M5 with some
enhancements. M5Base. Implements base routines for generating M5 Model trees and rule
the original algorithm M5 was invented by R. Quinlan and Yong Wang. M5P (where the P stands for
‘prime’) generates M5 model trees using the M5' algorithm, which was introduced in Wang & Witten
(1997) and enhances the original M5 algorithm by Quinlan (1992). The output of the analysis is shown
below:

=== Run information ===

Scheme: weka.classifiers.trees.M5P -M 4.0

Relation:    Copy of Data_Rudra-weka.filters.unsupervised.attribute.Remove-R1

Instances: 945

Attributes: 6

        BROAD, CLAIMS, DOMC, PSC, TOTRES, LIQLB

Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===

M5 pruned model tree:

(Using smoothed linear models)

BROAD <= 153.045 : LM1 (13/5.644%)

BROAD > 153.045 :

| PSC <= 203.02 :

| | BROAD <= 177.275 : LM2 (5/0.653%)

| | BROAD > 177.275 :

| | | TOTRES <= 871108500000 : LM3 (11/8.309%)

| | | TOTRES > 871108500000 : LM4 (4/1.446%)

| PSC > 203.02 : LM5 (7/2.741%)



LM num: 1

LIQLB = 0.7447 * BROAD + 0.1474 * PSC - 0 * TOTRES + 22.3168

LM num: 2

LIQLB = 0.586 * BROAD + 0.2788 * PSC - 0 * TOTRES + 30.3097

LM num: 3

LIQLB = 0.4606 * BROAD + 0.2504 * PSC - 0 * TOTRES + 58.87

LM num: 4

LIQLB = 0.4996 * BROAD + 0.2504 * PSC - 0 * TOTRES + 50.7563

LM num: 5

LIQLB = 0.7016 * BROAD + 0.2497 * PSC - 0 * TOTRES + 15.2517

Number of Rules: 5



Time taken to build model: 0.08 seconds
=== Cross-validation ===

=== Summary ===



Correlation coefficient          0.9882

Mean absolute error              3.412

Root mean squared error            5.4145

Relative absolute error           11.529 %

Root relative squared error       15.1993 %

Total Number of Instances          40

Ignored Class Unknown Instances     905




Interpretation of the Results:

Based on the data above M5 algorithm generates modular tree which is formed by 5 linear models (LM)
based on the initial values of Broad money in the economy which if less than equal to 153.045 then we
have to follow linear model 1 (LM 1) to estimate Liquidity in the economy. If BROAD> 153.045 we check
PSC and move down the tree and choosing corresponding models to get the Liquidity and finally GDP
values as shown in the figure above.



Linear Regression with Weka
The second technique is to conduct linear regression through Weka on the same data. When the
outcome, or class, is numeric and all the attributes are numeric, linear regression is a natural technique
to consider. In the previous technique we created five linear models from the same data; hence M5P’s
performance is slightly worse than any linear model. The idea is to express the class as a linear
combination of the attributes with predetermined weights. From the previous data, we can also find
linear regression equation between various parameters determining GDP. To run the regression, go to
classify tab on Weka and choose linear regression from functions as shown.




Figure 3 Shows where to find LR in Weka

Following output is generated by the above analysis:

=== Run information ===

Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8

Relation:    Copy of Data_Rudra

Instances: 945

Attributes: 7

        YEAR

        BROAD
CLAIMS

        DOMC

        PSC

        TOTRES

        LIQLB

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

Linear Regression Model

LIQLB = 1.2523 * BROAD + 0.6062 * CLAIMS + -0.1407 * DOMC 0           * TOTRES +     -6.9705

Time taken to build model: 0.2 seconds

=== Cross-validation ===

=== Summary ===



Correlation coefficient              0.9738

Mean absolute error                  4.8731

Root mean squared error               8.0404

Relative absolute error              16.4661 %

Root relative squared error          22.5707 %

Total Number of Instances               40

Ignored Class Unknown Instances         905

The above analysis gives as a mathematical relationship (linear) between various variables. The Value of
the fifth variable (dependent) can be found out once other independent variable values are known. This
equation also tells how these variables are related. A negative relation shows reciprocal relationship and
vice-versa. To see the same relation is pictorial form simply goes to visualize tab on Weka explorer. The
same is shown in the figure below.
CLUSTERING IN WEKA
Clustering is a technique used to group similar instances or rows in term of Euclidean distance. We have
used SimpeKMeans clustering algorithm to analyze clustering in our initial data. In SimpleKMeans
implementation clustering data use k-means, or the algorithm can decide using cross-validation- in
which case number of folds is fixed at 10. The figure below shows the output of SimpeKMeans for the
above data. The result is shown as table with rows that are attributes names and columns that
correspond to cluster centroids; an additional cluster at the beginning shows the entire data set. The
number of instances in each cluster appears in parenthesis at the top of its column. Each table entry is
either the mean or mode of the corresponding attribute for the cluster in that column. The bottom of
the output shows the result of applying the learned cluster model. In this case, it assigned each training
set to one of the clusters, showing the same result as the parenthetical numbers at the top of each
column. An alternative is to use a separate test set or a percentage split of training data, in which case
figures would be different. This technique could be used with data from other countries in addition of
the present data that is taken for Japan.
=== Run information ===

Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10

Relation:    Copy of Data_Rudra

Instances: 945

Attributes: 7

       YEAR

       BROAD

       CLAIMS

       DOMC

       PSC

       TOTRES

       LIQLB

Test mode:evaluate on training data
=== Model and evaluation on training set ===kMeans======

Number of iterations: 5

Within cluster sum of squared errors: 12.988387913678944

Missing values globally replaced with mean/mode



Cluster centroids:

                           Cluster#

Attribute      Full Data              0          1

                (945)           (929)          (16)

=================================================================

YEAR            1989.5         1989.2933        2001.5

BROAD          174.1633        173.4625      214.8525

CLAIMS          6.6645         6.8103         -1.7981

DOMC           242.2808        241.2956       299.4794

PSC            168.2627         167.8077       194.685

TOTRES      248907476505.9463 243675387834.3592           552695625000

LIQLB          175.2342        174.7166       205.2875

Time taken to build model (full training data) : 0.14 seconds

=== Model and evaluation on training set ===

Clustered    Instances

0           929 (98%)

1           16 (2%)

We can also visualize the clusters formed. Right click on the result-list output and select cluster visualize.
We get the following output:
DATA MINING WITH WEKA

More Related Content

Similar to DATA MINING WITH WEKA

Linear regression an 80 year study of the dow jones industrial average
Linear regression  an 80 year study of the dow jones industrial averageLinear regression  an 80 year study of the dow jones industrial average
Linear regression an 80 year study of the dow jones industrial average
TehyaSingleton
 
Linear regression an 80 year study of the dow jones industrial average
Linear regression  an 80 year study of the dow jones industrial averageLinear regression  an 80 year study of the dow jones industrial average
Linear regression an 80 year study of the dow jones industrial average
TehyaSingleton
 
Excel Model for Banking
Excel Model for Banking Excel Model for Banking
Excel Model for Banking
Flevy.com Best Practices
 
Moyno pump 2000 dimensions g2
Moyno pump 2000 dimensions g2Moyno pump 2000 dimensions g2
Moyno pump 2000 dimensions g2
NelsonBort
 
Eco dev final
Eco dev finalEco dev final
Rio cojedes total mediciones
Rio cojedes total medicionesRio cojedes total mediciones
Rio cojedes total mediciones
Orlando Rodriguez
 
cobb500 broiler performance nutrition supplement 2022
cobb500 broiler performance nutrition supplement 2022 cobb500 broiler performance nutrition supplement 2022
cobb500 broiler performance nutrition supplement 2022
AbdelRahman Yousef
 
Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110
Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110
Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110
BCV
 
Fedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_Importacion
Fedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_ImportacionFedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_Importacion
Fedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_Importacion
Fedegan
 
recipes
recipesrecipes
recipes
KEITH SMITH
 
Baja wf
Baja wfBaja wf
8 fv&amp;pv tables
8 fv&amp;pv tables8 fv&amp;pv tables
8 fv&amp;pv tables
Ahmed Elgazzar
 
Futurevaluetables
FuturevaluetablesFuturevaluetables
Futurevaluetables
deepuz05
 
Futurevaluetables
FuturevaluetablesFuturevaluetables
Present Value and Future Value Tables
Present Value and Future Value TablesPresent Value and Future Value Tables
Present Value and Future Value Tables
AdilMohsunov1
 
WRI Operating Statement Detail
WRI Operating Statement DetailWRI Operating Statement Detail
WRI Operating Statement Detail
Scott Pickering
 
Hyundai hdf15 3 forklift truck service repair manual
Hyundai hdf15 3 forklift truck service repair manualHyundai hdf15 3 forklift truck service repair manual
Hyundai hdf15 3 forklift truck service repair manual
fjjskekmdmme
 
Hyundai hdf15 3 forklift truck service repair manual
Hyundai hdf15 3 forklift truck service repair manualHyundai hdf15 3 forklift truck service repair manual
Hyundai hdf15 3 forklift truck service repair manual
fjjsekmmm
 
Hyundai hdf18 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manualHyundai hdf18 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manual
fjjsekmmm
 
Hyundai hdf18 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manualHyundai hdf18 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manual
fjjskekmdmme
 

Similar to DATA MINING WITH WEKA (20)

Linear regression an 80 year study of the dow jones industrial average
Linear regression  an 80 year study of the dow jones industrial averageLinear regression  an 80 year study of the dow jones industrial average
Linear regression an 80 year study of the dow jones industrial average
 
Linear regression an 80 year study of the dow jones industrial average
Linear regression  an 80 year study of the dow jones industrial averageLinear regression  an 80 year study of the dow jones industrial average
Linear regression an 80 year study of the dow jones industrial average
 
Excel Model for Banking
Excel Model for Banking Excel Model for Banking
Excel Model for Banking
 
Moyno pump 2000 dimensions g2
Moyno pump 2000 dimensions g2Moyno pump 2000 dimensions g2
Moyno pump 2000 dimensions g2
 
Eco dev final
Eco dev finalEco dev final
Eco dev final
 
Rio cojedes total mediciones
Rio cojedes total medicionesRio cojedes total mediciones
Rio cojedes total mediciones
 
cobb500 broiler performance nutrition supplement 2022
cobb500 broiler performance nutrition supplement 2022 cobb500 broiler performance nutrition supplement 2022
cobb500 broiler performance nutrition supplement 2022
 
Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110
Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110
Fundamental Equity Analysis - QMS Advisors HDAX FlexIndex 110
 
Fedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_Importacion
Fedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_ImportacionFedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_Importacion
Fedegan_Estadisticas_Boletin_Comercio_Exterior_Leche_Importacion
 
recipes
recipesrecipes
recipes
 
Baja wf
Baja wfBaja wf
Baja wf
 
8 fv&amp;pv tables
8 fv&amp;pv tables8 fv&amp;pv tables
8 fv&amp;pv tables
 
Futurevaluetables
FuturevaluetablesFuturevaluetables
Futurevaluetables
 
Futurevaluetables
FuturevaluetablesFuturevaluetables
Futurevaluetables
 
Present Value and Future Value Tables
Present Value and Future Value TablesPresent Value and Future Value Tables
Present Value and Future Value Tables
 
WRI Operating Statement Detail
WRI Operating Statement DetailWRI Operating Statement Detail
WRI Operating Statement Detail
 
Hyundai hdf15 3 forklift truck service repair manual
Hyundai hdf15 3 forklift truck service repair manualHyundai hdf15 3 forklift truck service repair manual
Hyundai hdf15 3 forklift truck service repair manual
 
Hyundai hdf15 3 forklift truck service repair manual
Hyundai hdf15 3 forklift truck service repair manualHyundai hdf15 3 forklift truck service repair manual
Hyundai hdf15 3 forklift truck service repair manual
 
Hyundai hdf18 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manualHyundai hdf18 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manual
 
Hyundai hdf18 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manualHyundai hdf18 3 forklift truck service repair manual
Hyundai hdf18 3 forklift truck service repair manual
 

More from Shubham Gupta

Marketing great dakota bank case - Harward Business School
Marketing   great dakota bank case - Harward Business SchoolMarketing   great dakota bank case - Harward Business School
Marketing great dakota bank case - Harward Business School
Shubham Gupta
 
Understanding Customer Value - Marketing through 4P's and SAVE
Understanding Customer Value - Marketing through 4P's and SAVEUnderstanding Customer Value - Marketing through 4P's and SAVE
Understanding Customer Value - Marketing through 4P's and SAVE
Shubham Gupta
 
Segmentation, Targeting and Positioning at an Election
Segmentation, Targeting and Positioning at an ElectionSegmentation, Targeting and Positioning at an Election
Segmentation, Targeting and Positioning at an Election
Shubham Gupta
 
The bose corporation: JIT II case solution
The bose corporation: JIT II case solutionThe bose corporation: JIT II case solution
The bose corporation: JIT II case solution
Shubham Gupta
 
Impure data analytics & visualization tool
Impure data analytics & visualization toolImpure data analytics & visualization tool
Impure data analytics & visualization tool
Shubham Gupta
 
Impure data analytics & visualization tool
Impure data analytics & visualization toolImpure data analytics & visualization tool
Impure data analytics & visualization tool
Shubham Gupta
 

More from Shubham Gupta (6)

Marketing great dakota bank case - Harward Business School
Marketing   great dakota bank case - Harward Business SchoolMarketing   great dakota bank case - Harward Business School
Marketing great dakota bank case - Harward Business School
 
Understanding Customer Value - Marketing through 4P's and SAVE
Understanding Customer Value - Marketing through 4P's and SAVEUnderstanding Customer Value - Marketing through 4P's and SAVE
Understanding Customer Value - Marketing through 4P's and SAVE
 
Segmentation, Targeting and Positioning at an Election
Segmentation, Targeting and Positioning at an ElectionSegmentation, Targeting and Positioning at an Election
Segmentation, Targeting and Positioning at an Election
 
The bose corporation: JIT II case solution
The bose corporation: JIT II case solutionThe bose corporation: JIT II case solution
The bose corporation: JIT II case solution
 
Impure data analytics & visualization tool
Impure data analytics & visualization toolImpure data analytics & visualization tool
Impure data analytics & visualization tool
 
Impure data analytics & visualization tool
Impure data analytics & visualization toolImpure data analytics & visualization tool
Impure data analytics & visualization tool
 

Recently uploaded

The Genesis of BriansClub.cm Famous Dark WEb Platform
The Genesis of BriansClub.cm Famous Dark WEb PlatformThe Genesis of BriansClub.cm Famous Dark WEb Platform
The Genesis of BriansClub.cm Famous Dark WEb Platform
SabaaSudozai
 
list of states and organizations .pdf
list of  states  and  organizations .pdflist of  states  and  organizations .pdf
list of states and organizations .pdf
Rbc Rbcua
 
The latest Heat Pump Manual from Newentide
The latest Heat Pump Manual from NewentideThe latest Heat Pump Manual from Newentide
The latest Heat Pump Manual from Newentide
JoeYangGreatMachiner
 
Dpboss Matka Guessing Satta Matta Matka Kalyan panel Chart Indian Matka Dpbos...
Dpboss Matka Guessing Satta Matta Matka Kalyan panel Chart Indian Matka Dpbos...Dpboss Matka Guessing Satta Matta Matka Kalyan panel Chart Indian Matka Dpbos...
Dpboss Matka Guessing Satta Matta Matka Kalyan panel Chart Indian Matka Dpbos...
➒➌➎➏➑➐➋➑➐➐Dpboss Matka Guessing Satta Matka Kalyan Chart Indian Matka
 
Chapter 7 Final business management sciences .ppt
Chapter 7 Final business management sciences .pptChapter 7 Final business management sciences .ppt
Chapter 7 Final business management sciences .ppt
ssuser567e2d
 
2022 Vintage Roman Numerals Men Rings
2022 Vintage Roman  Numerals  Men  Rings2022 Vintage Roman  Numerals  Men  Rings
2022 Vintage Roman Numerals Men Rings
aragme
 
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
Lacey Max
 
How MJ Global Leads the Packaging Industry.pdf
How MJ Global Leads the Packaging Industry.pdfHow MJ Global Leads the Packaging Industry.pdf
How MJ Global Leads the Packaging Industry.pdf
MJ Global
 
The Most Inspiring Entrepreneurs to Follow in 2024.pdf
The Most Inspiring Entrepreneurs to Follow in 2024.pdfThe Most Inspiring Entrepreneurs to Follow in 2024.pdf
The Most Inspiring Entrepreneurs to Follow in 2024.pdf
thesiliconleaders
 
The APCO Geopolitical Radar - Q3 2024 The Global Operating Environment for Bu...
The APCO Geopolitical Radar - Q3 2024 The Global Operating Environment for Bu...The APCO Geopolitical Radar - Q3 2024 The Global Operating Environment for Bu...
The APCO Geopolitical Radar - Q3 2024 The Global Operating Environment for Bu...
APCO
 
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian MatkaDpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
➒➌➎➏➑➐➋➑➐➐Dpboss Matka Guessing Satta Matka Kalyan Chart Indian Matka
 
Best practices for project execution and delivery
Best practices for project execution and deliveryBest practices for project execution and delivery
Best practices for project execution and delivery
CLIVE MINCHIN
 
Part 2 Deep Dive: Navigating the 2024 Slowdown
Part 2 Deep Dive: Navigating the 2024 SlowdownPart 2 Deep Dive: Navigating the 2024 Slowdown
Part 2 Deep Dive: Navigating the 2024 Slowdown
jeffkluth1
 
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
taqyea
 
Unveiling the Dynamic Personalities, Key Dates, and Horoscope Insights: Gemin...
Unveiling the Dynamic Personalities, Key Dates, and Horoscope Insights: Gemin...Unveiling the Dynamic Personalities, Key Dates, and Horoscope Insights: Gemin...
Unveiling the Dynamic Personalities, Key Dates, and Horoscope Insights: Gemin...
my Pandit
 
Pitch Deck Teardown: Kinnect's $250k Angel deck
Pitch Deck Teardown: Kinnect's $250k Angel deckPitch Deck Teardown: Kinnect's $250k Angel deck
Pitch Deck Teardown: Kinnect's $250k Angel deck
HajeJanKamps
 
Registered-Establishment-List-in-Uttarakhand-pdf.pdf
Registered-Establishment-List-in-Uttarakhand-pdf.pdfRegistered-Establishment-List-in-Uttarakhand-pdf.pdf
Registered-Establishment-List-in-Uttarakhand-pdf.pdf
dazzjoker
 
Brian Fitzsimmons on the Business Strategy and Content Flywheel of Barstool S...
Brian Fitzsimmons on the Business Strategy and Content Flywheel of Barstool S...Brian Fitzsimmons on the Business Strategy and Content Flywheel of Barstool S...
Brian Fitzsimmons on the Business Strategy and Content Flywheel of Barstool S...
Neil Horowitz
 
Maksym Vyshnivetskyi: PMO KPIs (UA) (#12)
Maksym Vyshnivetskyi: PMO KPIs (UA) (#12)Maksym Vyshnivetskyi: PMO KPIs (UA) (#12)
Maksym Vyshnivetskyi: PMO KPIs (UA) (#12)
Lviv Startup Club
 
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta MatkaDpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
➒➌➎➏➑➐➋➑➐➐Dpboss Matka Guessing Satta Matka Kalyan Chart Indian Matka
 

Recently uploaded (20)

The Genesis of BriansClub.cm Famous Dark WEb Platform
The Genesis of BriansClub.cm Famous Dark WEb PlatformThe Genesis of BriansClub.cm Famous Dark WEb Platform
The Genesis of BriansClub.cm Famous Dark WEb Platform
 
list of states and organizations .pdf
list of  states  and  organizations .pdflist of  states  and  organizations .pdf
list of states and organizations .pdf
 
The latest Heat Pump Manual from Newentide
The latest Heat Pump Manual from NewentideThe latest Heat Pump Manual from Newentide
The latest Heat Pump Manual from Newentide
 
Dpboss Matka Guessing Satta Matta Matka Kalyan panel Chart Indian Matka Dpbos...
Dpboss Matka Guessing Satta Matta Matka Kalyan panel Chart Indian Matka Dpbos...Dpboss Matka Guessing Satta Matta Matka Kalyan panel Chart Indian Matka Dpbos...
Dpboss Matka Guessing Satta Matta Matka Kalyan panel Chart Indian Matka Dpbos...
 
Chapter 7 Final business management sciences .ppt
Chapter 7 Final business management sciences .pptChapter 7 Final business management sciences .ppt
Chapter 7 Final business management sciences .ppt
 
2022 Vintage Roman Numerals Men Rings
2022 Vintage Roman  Numerals  Men  Rings2022 Vintage Roman  Numerals  Men  Rings
2022 Vintage Roman Numerals Men Rings
 
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
 
How MJ Global Leads the Packaging Industry.pdf
How MJ Global Leads the Packaging Industry.pdfHow MJ Global Leads the Packaging Industry.pdf
How MJ Global Leads the Packaging Industry.pdf
 
The Most Inspiring Entrepreneurs to Follow in 2024.pdf
The Most Inspiring Entrepreneurs to Follow in 2024.pdfThe Most Inspiring Entrepreneurs to Follow in 2024.pdf
The Most Inspiring Entrepreneurs to Follow in 2024.pdf
 
The APCO Geopolitical Radar - Q3 2024 The Global Operating Environment for Bu...
The APCO Geopolitical Radar - Q3 2024 The Global Operating Environment for Bu...The APCO Geopolitical Radar - Q3 2024 The Global Operating Environment for Bu...
The APCO Geopolitical Radar - Q3 2024 The Global Operating Environment for Bu...
 
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian MatkaDpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
 
Best practices for project execution and delivery
Best practices for project execution and deliveryBest practices for project execution and delivery
Best practices for project execution and delivery
 
Part 2 Deep Dive: Navigating the 2024 Slowdown
Part 2 Deep Dive: Navigating the 2024 SlowdownPart 2 Deep Dive: Navigating the 2024 Slowdown
Part 2 Deep Dive: Navigating the 2024 Slowdown
 
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
一比一原版新西兰奥塔哥大学毕业证(otago毕业证)如何办理
 
Unveiling the Dynamic Personalities, Key Dates, and Horoscope Insights: Gemin...
Unveiling the Dynamic Personalities, Key Dates, and Horoscope Insights: Gemin...Unveiling the Dynamic Personalities, Key Dates, and Horoscope Insights: Gemin...
Unveiling the Dynamic Personalities, Key Dates, and Horoscope Insights: Gemin...
 
Pitch Deck Teardown: Kinnect's $250k Angel deck
Pitch Deck Teardown: Kinnect's $250k Angel deckPitch Deck Teardown: Kinnect's $250k Angel deck
Pitch Deck Teardown: Kinnect's $250k Angel deck
 
Registered-Establishment-List-in-Uttarakhand-pdf.pdf
Registered-Establishment-List-in-Uttarakhand-pdf.pdfRegistered-Establishment-List-in-Uttarakhand-pdf.pdf
Registered-Establishment-List-in-Uttarakhand-pdf.pdf
 
Brian Fitzsimmons on the Business Strategy and Content Flywheel of Barstool S...
Brian Fitzsimmons on the Business Strategy and Content Flywheel of Barstool S...Brian Fitzsimmons on the Business Strategy and Content Flywheel of Barstool S...
Brian Fitzsimmons on the Business Strategy and Content Flywheel of Barstool S...
 
Maksym Vyshnivetskyi: PMO KPIs (UA) (#12)
Maksym Vyshnivetskyi: PMO KPIs (UA) (#12)Maksym Vyshnivetskyi: PMO KPIs (UA) (#12)
Maksym Vyshnivetskyi: PMO KPIs (UA) (#12)
 
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta MatkaDpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
 

DATA MINING WITH WEKA

  • 1. Term paper on Data mining How to use Weka for data analysis Submitted by: Shubham Gupta (10BM60085) Vinod Gupta School of Management
  • 2. The first technique that we would do on weka is classification. The data below shows the financial situation in Japan. The data has been collected from 1970-2009. The columns represent: 1) BROAD: Broad money supplied in the economy 2) DOMC: Domestic consumption 3) PSC: Payment securities 4) CLAIMS: Represents the claims on the government. 5) TOTRES: Total Reserves 6) GDP: Gross domestic product 7) LIQLB: Liquid Liability We want to get a decision tree that would help us decide what values of independent variable may result in what final rule or result. For example if we know that for a DOMC> 140 and PSC> 150.3 we would always get say GDP of greater than 3 trillion yen, then it would help us in making our decisions better. Hence to get such rules we perform this analysis to generate a decision tree. YEAR BROAD CLAIMS DOMC PSC TOTRES LIQLB GDP 1970 83.65 61.88 134.25 111.75 4876114550 104.73 205,995,000,000 1971 106.70 21.37 147.59 123.72 15469150615 118.21 232,681,000,000 1972 116.14 23.17 160.29 133.47 18932675966 129.03 308,137,000,000 1973 116.02 19.84 157.87 132.20 13723930639 126.07 418,640,000,000 1974 113.08 13.72 154.00 126.49 16551248298 120.50 464,705,000,000 1975 118.31 13.02 164.40 129.96 14910849997 127.56 505,317,000,000 1976 122.40 12.09 169.96 130.63 18590784646 131.20 567,926,000,000 1977 125.82 8.76 172.45 128.49 25907710023 133.90 698,968,000,000 1978 130.36 8.56 178.29 127.71 37824744320 139.12 982,078,000,000 1979 135.51 8.19 183.05 129.23 31926244737 142.67 1,022,190,000,000 1980 137.95 8.09 188.44 131.29 38918848626 144.30 1,071,000,000,000 1981 142.13 8.04 194.09 134.10 37839039769 150.03 1,183,790,000,000 1982 149.54 7.67 203.99 139.59 34403732201 156.18 1,100,410,000,000 1983 156.55 6.72 213.12 145.03 33844549531 162.92 1,200,190,000,000 1984 159.31 6.69 217.77 147.43 33898638541 165.34 1,275,560,000,000 1985 160.68 7.66 220.09 149.90 34641202378 167.41 1,364,160,000,000 1986 167.30 7.67 230.23 156.30 51727320082 174.65 2,020,890,000,000 1987 175.85 12.27 243.85 173.48 92701641597 183.77 2,448,670,000,000 1988 178.70 10.66 251.68 182.52 1.06668E+11 186.47 2,971,030,000,000 1989 182.62 10.13 258.13 190.28 93672771034 192.14 2,972,670,000,000 1990 184.06 8.46 259.15 194.81 87828362969 190.16 3,058,040,000,000 1991 184.35 5.20 257.54 195.40 80625855126 189.32 3,484,770,000,000 1992 187.89 4.16 265.33 199.63 79696644593 190.93 3,796,110,000,000 1993 193.97 1.33 274.00 202.14 1.07989E+11 198.16 4,350,010,000,000 1994 200.35 1.88 281.02 204.58 1.35146E+11 204.45 4,778,990,000,000
  • 3. 1995 205.79 1.26 287.13 203.90 1.9262E+11 209.90 5,264,380,000,000 1996 209.72 1.81 292.42 205.21 2.25594E+11 213.63 4,642,540,000,000 1997 215.31 6.47 276.47 217.76 2.26679E+11 221.38 4,261,840,000,000 1998 229.64 1.80 298.40 228.01 2.22443E+11 233.17 3,857,030,000,000 1999 239.91 -1.20 309.92 231.08 2.93948E+11 243.22 4,368,730,000,000 2000 242.24 -1.58 308.91 222.28 3.61639E+11 243.84 4,667,450,000,000 2001 225.31 -33.25 299.43 193.01 4.01958E+11 187.41 4,095,480,000,000 2002 207.79 -4.32 299.16 182.40 4.69618E+11 190.79 3,918,340,000,000 2003 209.70 -1.99 307.26 180.71 6.73554E+11 191.84 4,229,100,000,000 2004 207.51 -1.10 303.48 174.12 8.44667E+11 189.79 4,605,920,000,000 2005 207.24 1.79 312.85 182.87 8.46896E+11 189.30 4,552,200,000,000 2006 204.73 -0.14 304.96 179.99 8.95321E+11 186.06 4,362,590,000,000 2007 201.50 0.16 294.31 172.56 9.73297E+11 184.17 4,377,940,000,000 2008 207.14 0.76 295.42 165.48 1.03076E+12 189.52 4,879,860,000,000 2009 223.76 -1.12 320.53 171.00 1.04899E+12 206.13 5,032,980,000,000 Loading data in Weka is quite easy. Just click on the open file option and give the location of the file. Figure 1 Shows how to load data in Weka Weka software is used to classify the above data to find out how these economical factors be modified or fixed so as to get an 11% growth in the previous year’s GDP
  • 4. Figure 2 Diagram shows where you could the used tree technique The following shows the output by running the above data in Weka. The Classifier used is to create the required decision tree is M5P. Weka's M5P algorithm is a rational reconstruction of M5 with some enhancements. M5Base. Implements base routines for generating M5 Model trees and rule the original algorithm M5 was invented by R. Quinlan and Yong Wang. M5P (where the P stands for ‘prime’) generates M5 model trees using the M5' algorithm, which was introduced in Wang & Witten (1997) and enhances the original M5 algorithm by Quinlan (1992). The output of the analysis is shown below: === Run information === Scheme: weka.classifiers.trees.M5P -M 4.0 Relation: Copy of Data_Rudra-weka.filters.unsupervised.attribute.Remove-R1 Instances: 945 Attributes: 6 BROAD, CLAIMS, DOMC, PSC, TOTRES, LIQLB Test mode: 10-fold cross-validation
  • 5. === Classifier model (full training set) === M5 pruned model tree: (Using smoothed linear models) BROAD <= 153.045 : LM1 (13/5.644%) BROAD > 153.045 : | PSC <= 203.02 : | | BROAD <= 177.275 : LM2 (5/0.653%) | | BROAD > 177.275 : | | | TOTRES <= 871108500000 : LM3 (11/8.309%) | | | TOTRES > 871108500000 : LM4 (4/1.446%) | PSC > 203.02 : LM5 (7/2.741%) LM num: 1 LIQLB = 0.7447 * BROAD + 0.1474 * PSC - 0 * TOTRES + 22.3168 LM num: 2 LIQLB = 0.586 * BROAD + 0.2788 * PSC - 0 * TOTRES + 30.3097 LM num: 3 LIQLB = 0.4606 * BROAD + 0.2504 * PSC - 0 * TOTRES + 58.87 LM num: 4 LIQLB = 0.4996 * BROAD + 0.2504 * PSC - 0 * TOTRES + 50.7563 LM num: 5 LIQLB = 0.7016 * BROAD + 0.2497 * PSC - 0 * TOTRES + 15.2517 Number of Rules: 5 Time taken to build model: 0.08 seconds
  • 6. === Cross-validation === === Summary === Correlation coefficient 0.9882 Mean absolute error 3.412 Root mean squared error 5.4145 Relative absolute error 11.529 % Root relative squared error 15.1993 % Total Number of Instances 40 Ignored Class Unknown Instances 905 Interpretation of the Results: Based on the data above M5 algorithm generates modular tree which is formed by 5 linear models (LM) based on the initial values of Broad money in the economy which if less than equal to 153.045 then we
  • 7. have to follow linear model 1 (LM 1) to estimate Liquidity in the economy. If BROAD> 153.045 we check PSC and move down the tree and choosing corresponding models to get the Liquidity and finally GDP values as shown in the figure above. Linear Regression with Weka The second technique is to conduct linear regression through Weka on the same data. When the outcome, or class, is numeric and all the attributes are numeric, linear regression is a natural technique to consider. In the previous technique we created five linear models from the same data; hence M5P’s performance is slightly worse than any linear model. The idea is to express the class as a linear combination of the attributes with predetermined weights. From the previous data, we can also find linear regression equation between various parameters determining GDP. To run the regression, go to classify tab on Weka and choose linear regression from functions as shown. Figure 3 Shows where to find LR in Weka Following output is generated by the above analysis: === Run information === Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: Copy of Data_Rudra Instances: 945 Attributes: 7 YEAR BROAD
  • 8. CLAIMS DOMC PSC TOTRES LIQLB Test mode: 10-fold cross-validation === Classifier model (full training set) === Linear Regression Model LIQLB = 1.2523 * BROAD + 0.6062 * CLAIMS + -0.1407 * DOMC 0 * TOTRES + -6.9705 Time taken to build model: 0.2 seconds === Cross-validation === === Summary === Correlation coefficient 0.9738 Mean absolute error 4.8731 Root mean squared error 8.0404 Relative absolute error 16.4661 % Root relative squared error 22.5707 % Total Number of Instances 40 Ignored Class Unknown Instances 905 The above analysis gives as a mathematical relationship (linear) between various variables. The Value of the fifth variable (dependent) can be found out once other independent variable values are known. This equation also tells how these variables are related. A negative relation shows reciprocal relationship and vice-versa. To see the same relation is pictorial form simply goes to visualize tab on Weka explorer. The same is shown in the figure below.
  • 9. CLUSTERING IN WEKA Clustering is a technique used to group similar instances or rows in term of Euclidean distance. We have used SimpeKMeans clustering algorithm to analyze clustering in our initial data. In SimpleKMeans implementation clustering data use k-means, or the algorithm can decide using cross-validation- in which case number of folds is fixed at 10. The figure below shows the output of SimpeKMeans for the above data. The result is shown as table with rows that are attributes names and columns that correspond to cluster centroids; an additional cluster at the beginning shows the entire data set. The number of instances in each cluster appears in parenthesis at the top of its column. Each table entry is either the mean or mode of the corresponding attribute for the cluster in that column. The bottom of the output shows the result of applying the learned cluster model. In this case, it assigned each training set to one of the clusters, showing the same result as the parenthetical numbers at the top of each column. An alternative is to use a separate test set or a percentage split of training data, in which case figures would be different. This technique could be used with data from other countries in addition of the present data that is taken for Japan.
  • 10. === Run information === Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: Copy of Data_Rudra Instances: 945 Attributes: 7 YEAR BROAD CLAIMS DOMC PSC TOTRES LIQLB Test mode:evaluate on training data
  • 11. === Model and evaluation on training set ===kMeans====== Number of iterations: 5 Within cluster sum of squared errors: 12.988387913678944 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute Full Data 0 1 (945) (929) (16) ================================================================= YEAR 1989.5 1989.2933 2001.5 BROAD 174.1633 173.4625 214.8525 CLAIMS 6.6645 6.8103 -1.7981 DOMC 242.2808 241.2956 299.4794 PSC 168.2627 167.8077 194.685 TOTRES 248907476505.9463 243675387834.3592 552695625000 LIQLB 175.2342 174.7166 205.2875 Time taken to build model (full training data) : 0.14 seconds === Model and evaluation on training set === Clustered Instances 0 929 (98%) 1 16 (2%) We can also visualize the clusters formed. Right click on the result-list output and select cluster visualize. We get the following output: