SlideShare a Scribd company logo
1 of 100
DATAWAREHOUSETOOLS
Cloudera
Teradata
Oracle
TabLeau
OPENSOURCEDATAMININGTOOLS
WEKA
Orange
KNIME
R-Programming
DATAWAREHOUSING ANDDATA MINING LAB
INDEX
S.No Nameofthe Experiment PgNo Date Signature
1
DataProcessingTechniques:
(i) DataCleaning
(ii) DataTransformation-Normalization
(iii) DataIntegration
1
2
DataWarehouseSchemas:Star,Snowflake,Fact
Constellation
12
3
DataCubeConstruction-OLAPoperations
21
4
DataExtraction,Transformations,Loadingoperations
31
5
ImplementationofApriorialgorithm
45
6
ImplementationofFP-Growth algorithm
54
7
ImplementationofDecisionTree Induction
62
8
Calculatinginformationgainmeasures
69
9
ClassificationofdatausingBayesianapproach
77
10
ClassificationofdatausingK-NearestNeighbor approach
85
11
ImplementationofK-Meansalgorithm
92
InformationTechnology Page1
Experiment1:Performdatapreprocessingtasks Preprocess
Tab
1. Loading Data
The first fourbuttonsatthetopofthepreprocesssectionenable youtoloaddatainto WEKA:
1. Openfile........Bringsup adialogboxallowingyou tobrowseforthedatafileonthelocalfile
system.
2. OpenURL.......Asksfor aUniformResourceLocatoraddressforwherethedataisstored.
3. OpenDB........Readsdatafromadatabase.(Notethat to makethisworkyoumighthavetoedit
thefileinweka/experiment/DatabaseUtils.props.)
4. Generate.......Enablesyouto generateartificialdatafroma varietyofData Generators.Using
theOpenfile......buttonyoucanreadfilesinavarietyofformats:WEKA’sARFFformat, CSV format,
C4.5format,orserializedInstancesformat.ARFFfilestypicallyhavea.arffextension,CSV filesa.csv extension,
C4.5 files a .data and .names extension, and serialized Instances objects a .bsi extension.
Current Relation: Once some data has been loaded, the Preprocess panel shows a variety of
information.TheCurrentrelationbox(the―current relation‖isthe currentlyloadeddata, which
can be interpreted as a single relationaltable in database terminology) has three entries:
1. Relation.Thenameoftherelation,asgiveninthe fileitwasloadedfrom. Filters(described
InformationTechnology Page2
below)modifythenameofarelation.
2. Instances.Thenumberofinstances(datapoints/records)inthedata.
3. Attributes. Thenumber ofattributes(features)inthedata.
WorkingwithAttributes
BelowtheCurrentrelationboxisaboxtitled Attributes.Therearefourbuttons,and beneath
them is a list of the attributes in the current relation.
The list hasthreecolumns:
1. No. Anumberthat identifiestheattributeintheordertheyarespecifiedinthe datafile.
2. Selectiontickboxes. Theseallowyouselectwhichattributesarepresentintherelation.
3. Name. The name ofthe attribute, as it was declared in the data file. When you click
ondifferent rowsinthe list ofattributes,thefieldschange intheboxtotherighttitledSelected
attribute.
Thisboxdisplaysthecharacteristicsofthecurrentlyhighlightedattributeinthelist:
1. Name.Thenameoftheattribute,thesameasthat given intheattributelist.
2. Type. Thetypeofattribute,mostcommonlyNominalorNumeric.
3. Missing.Thenumber(andpercentage)ofinstancesinthedataforwhichthisattribute is
missing (unspecified).
4. Distinct.Thenumberofdifferentvaluesthatthedatacontains forthisattribute.
5. Unique.Thenumber(andpercentage)ofinstancesinthedatahavingavalue forthisattribute that no
other instances have.
Below these statistics is a list showing more information about the values stored in this
attribute, which differ depending on its type. If the attribute is nominal, the list consists of each
possible value for the attribute along with the number of instances that have that value. If the
attribute is numeric, the list gives four statistics describing the distribution of values in the data—
the minimum, maximum, mean and standard deviation. And below these statistics there is a
coloured histogram, colour-coded according to the attribute chosen as the Class using the
boxabovethe histogram. (This boxwillbringupadrop-downlist ofavailable selectionswhenclicked.)
Note that only nominal Class attributes will result in a color-coding. Finally, after pressing the
Visualize All button, histograms for all the attributes in the data are shown in a separate window.
Returning to the attribute list, to begin with all the tick boxes are unticked. They can be toggled
on/off by clicking on them individually. The four buttons above can also be used to change the
selection.
InformationTechnology Page3
PREPROCESSING
1. All.Allboxesareticked.
2. None.Allboxesarecleared(unticked).
3. Invert.Boxesthataretickedbecomeuntickedandvice versa.
4. Pattern.Enablestheusertoselect attributesbasedonaPerl5RegularExpression.E.g.,.*id selects
all attributes which name ends with id.
Once the desired attributes have been selected, they can be removed by clicking the Remove
buttonbelow the list ofattributes. Notethat this can be undone byclicking the Undo button, which is
located next to the Edit button in the top-right corner of the Preprocess panel.
Working withFilters:-
The preprocess section allows filters to be defined that transform the data in various ways. The
Filter box is used to set up the filters that are required. At the left of the Filter box is a Choose
button. Byclicking this buttonit is possible to selectone ofthe filters inWEKA. Once a filter has
been selected, its name and options are shown in the field next to the Choose button. Clicking on
this box with the left mouse button brings up a GenericObjectEditor dialog box. A click with the
right mouse button (or Alt+Shift+left click) brings up a menu where you can choose, either to
display the properties in a GenericObjectEditor dialog box, or to copy the current setup string to
theclipboard.
InformationTechnology Page4
TheGenericObjectEditorDialogBox
TheGenericObjectEditordialog boxlets youconfigureafilter. Thesamekind
ofdialog box is used to configureotherobjects, suchasclassifiersand clusterers. The
fields in the window reflect the available options.
Right-clicking(orAlt+Shift+Left-Click)onsucha fieldwillbringupapopup menu, listingthe
following options:
1. Show properties... has the same effect as left-clicking on the field, i.e., a dialogappears
allowing you to alter the settings.
2. Copy configuration to clipboard copies the currently displayed configuration string to the
system’s clipboard and therefore can be used anywhere else in WEKA or in the console. This is
rather handy if you have to setup complicated, nested schemes.
3. Enter configuration... is the ―receiving‖ end for configurations that got copied to the
clipboard earlier on. In this dialog you can enter a class name followed by options (if the class
supports these). This also allows you to transfer a filter setting from the Preprocess panel to a
Filtered Classifier used in the Classify panel.
Left-Clicking on anyof these gives an opportunityto alter the filters settings. For example,
the settingmaytake a textstring,in which caseyou type the stringinto the text field provided. Or it
may give a drop-down box listing several states to choose from. Or it may do something else,
depending on the information required. Information on the options is providedin a tool tipif you let
the mouse pointer hover of the corresponding field. More information on the filter and its options
can be obtained by clicking on the More button in the About panel at the top of the
GenericObjectEditor window.
ApplyingFilters
Once you have selected and configured a filter, you can apply it to the data bypressing the
Apply button at the right end of the Filter panel in the Preprocess panel. The Preprocess panelwill
then show the transformed data. The change can be undone by pressing the Undo button. You can
also usetheEdit...buttonto modifyyourdatamanuallyinadataset editor. Finally, theSave...button at the
top right of the Preprocess panel saves the current version of the relation in file formats that can
represent the relation, allowing it to be kept for future use.
Stepsforrunpreprocessing tab in WEKA:
1. OpenWEKATool.
2. Click onWEKAExplorer.
3. ClickonPreprocessingtabbutton.
InformationTechnology Page5
4. Clickonopenfile button.
5. ChooseWEKAfolderinCdrive.
6. Selectand Clickondataoption button.
7. Chooselabordataset and openfile.
8. ChoosefilterbuttonandselecttheUnsupervised-Discritizeoptionandapply
Datasetlabor.arff
Exercise1:
a. Removespacesfromthefileby using Java.
b. Normalizethedatausingmin-maxnormalization
InformationTechnology Page6
RECORDNOTES
InformationTechnology Page7
InformationTechnology Page8
InformationTechnology Page9
InformationTechnology Page10
InformationTechnology Page11
InformationTechnology Page12
Experiment 2: Design multi-dimensional data models namely Star, Snowflake and Fact
Constellation schemas for any one enterprise (ex. Banking, Insurance, Finance, Healthcare,
manufacturing, Automobiles, sales etc).
SchemaDefinition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definitionand dimensiondefinition, can be used for defining the data warehouses
and data marts.
StarSchema
 Eachdimension inastarschemaisrepresented withonlyone-dimensiontable.
 Thisdimensiontablecontainsthesetofattributes.
 Thefollowingdiagramshowsthesalesdataofacompanywithrespecttothefour
dimensions, namely time, item, branch, and location.
 Thereisafacttableatthecenter.Itcontainsthekeystoeachoffourdimensions.
 Thefacttablealsocontainstheattributes,namelydollarssoldandunitssold.
InformationTechnology Page13
SnowflakeSchema
 SomedimensiontablesintheSnowflakeschemaarenormalized.
 Thenormalizationsplitsupthedataintoadditionaltables.
 UnlikeStarschema,thedimensionstableinasnowflakeschemais normalized. For
example,theitemdimensiontableinstarschemaisnormalizedandsplitintotwo dimension tables, namely
item and supplier table.
 Nowtheitemdimensiontablecontainstheattributesitem_key, item_name,type,brand, and
supplier-key.
 Thesupplierkeyislinkedtothesupplierdimensiontable.Thesupplierdimensiontable
contains the attributes supplier_key and supplier_type.
InformationTechnology Page14
FactConstellationSchema
 Afactconstellationhasmultiple facttables.Itisalsoknownasgalaxyschema.
 Thefollowingdiagramshowstwofacttables,namelysalesandshipping.
 Thesalesfacttableissameasthatinthestar schema.
 Theshippingfacttablehasthefivedimensions,namelyitem_key,time_key,shipper_key,
from_location, to_location.
 Theshippingfacttablealsocontainstwomeasures,namelydollarssoldandunitssold.
 Itisalsopossibletosharedimensiontablesbetweenfacttables.Forexample,time,item, and
location dimension tables are shared between the sales and shipping fact table.
Exercise2:
DesigndatawarehouseschemasforBankingapplication.
InformationTechnology Page15
RECORDNOTES
InformationTechnology Page16
InformationTechnology Page17
InformationTechnology Page18
InformationTechnology Page19
InformationTechnology Page20
InformationTechnology Page21
Experiment3:PerformVariousOLAPoperationssuch slice,dice,rollup,drillupand pivot.
OLAPOPERATIONS
Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It
allows managers, and analysts to get an insight of the information through fast, consistent, and
interactive access to information. Here is the list of OLAP operations:
 Roll-up
 Drill-down
 Sliceanddice
 Pivot(rotate)
Roll-up
Roll-upperformsaggregationona datacube inanyofthefollowingways:
 Byclimbingup aconcepthierarchyforadimension
 Bydimensionreduction
Thefollowingdiagramillustrateshowroll-up works.
InformationTechnology Page22
 Roll-up isperformedbyclimbingupaconcepthierarchyforthedimensionlocation.
 Initiallytheconcepthierarchywas"street<city<province<country".
 Onrollingup,thedataisaggregatedbyascendingthelocationhierarchyfromthelevel o
fcity to
the level of country.
 Thedataisgroupedinto citiesratherthancountries.
 Whenroll-upisperformed,oneormoredimensionsfromthedatacubeareremoved.
Drill-down
Drill-downisthereverseoperationofroll-up.Itisperformedbyeither ofthefollowing ways:
 Bystepping downaconcepthierarchyforadimension
 Byintroducinganewdimension.
Thefollowingdiagramillustrateshowdrill-downworks:
 Drill-downisperformedbysteppingdownaconcepthierarchyforthedimensiontime.
 Initiallytheconcepthierarchywas"day<month<quarter<year."
 Ondrillingdown,thetimedimensionisdescended fromthe levelofquartertothelevel o
fmonth.
 Whendrill-downisperformed,oneormoredimensionsfromthedatacubeareadded.
 Itnavigatesthedatafromlessdetailed datatohighlydetaileddata.
InformationTechnology Page23
Slice
Theslice operationselectsoneparticular dimensionfromagivencubeandprovidesanewsub- cube.
Consider the following diagram that shows how slice works.
 HereSliceisperformedforthedimension"time"usingthecriteriontime="Q1".
 Itwillformanewsub-cubebyselecting oneormoredimensions.
Dice
Diceselectstwoormoredimensions fromagivencubeandprovidesanewsub-cube. Consider the
following diagram that shows the dice operation.
InformationTechnology Page24
Thediceoperationonthecubebased onthefollowing selectioncriteriainvolvesthreedimensions.
 (location="Toronto"or"Vancouver")
(time = "Q1" or "Q2")
 (item=" Mobile" or"Modem")
Pivot
The pivot operation is also known as rotation. It rotatesthe data axes in view in order to provide
analternativepresentationofdata.Considerthe followingdiagramthat showsthepivotoperation.
Exercise3:
ApplytheOLAPoperationsfortheabovebanking application.
InformationTechnology Page25
RECORDNOTES
InformationTechnology Page26
InformationTechnology Page27
InformationTechnology Page28
InformationTechnology Page29
InformationTechnology Page30
InformationTechnology Page31
Experiment4: ETLscriptsandimplementusingdatawarehousetools.
ETL comes fromData Warehousing and stands for Extract-Transform-Load. ETL covers a process
of how the data are loaded from the source system to the data warehouse. Extraction–
transformation–loading (ETL) tools are pieces of software responsible for the extraction of data
from several sources, its cleansing, customization, reformatting, integration, and insertion into a
data warehouse.
Building the ETL process is potentially one of the biggest tasks of building a warehouse; it is
complex, time consuming, and consumes most ofdata warehouse project‘s implementationefforts,
costs, and resources.
Buildingadatawarehouserequires focusingcloselyonunderstandingthreemainareas:
1. SourceArea-Thesourceareahasstandardmodels suchasentityrelationshipdiagram.
2. DestinationArea-Thedestinationareahasstandard modelssuchasstarschema.
3. MappingArea- Butthemapping areahas notastandardmodeltillnow.
ETLProcess:
Extract
The Extract step covers the data extraction from the source system and makes it accessible for
further processing. The main objective of the extract step is to retrieve all the required data fromthe
source systemwith as little resources as possible. The extract step should be designed in a way that
it does not negatively affect the source system in terms or performance, response time or any kind
of locking.
Thereare severalwaysto performtheextract:
 Update notification- ifthe source systemisable to provide a notificationthat arecord has been
changed and describe the change, this is the easiest wayto get the data.
 Incremental extract - some systems may not be able to provide notification that an update has
occurred, but they are able to identify which records have been modified and provide an extract of
such records. During further ETL steps, the system needs to identify changes and propagate itdown.
Note, that byusing daily extract, we may not be able to handle deleted recordsproperly.
 Full extract - some systems are not able to identify which data has been changed at all, so a ful
extract is the only way one can get the data out of the system. The full extract requires keeping a
copy of the last extract in the same format in order to be able to identify changes. Full extract
handles deletions as well.
InformationTechnology Page32
Transform
The transform step applies a set of rules to transform the data from the source to the target. This
includes converting any measured data to the same dimension(i.e. conformed dimension) using the
same units so that they can later be joined. The transformation step also requires joining data from
several sources, generating aggregates, generating surrogate keys, sorting, deriving new calculated
values, and applying advanced validation rules.
Load
During the load step, it is necessaryto ensure that the load is performed correctly and with as little
resources as possible. The target of the Load process is often a database. In order to make the load
processefficient,it ishelpfulto disableanyconstraintsand indexesbeforethe loadandenablethem back
only after the load completes. The referential integrity needs to be maintained by ETL tool to ensure
consistency.
ETLmethod
ETL as scripts that canjust be runonthe database. These scripts must be re-runnable: theyshould
beableto berunwithout modificationtopickupanychangesinthe legacydata,andautomatically work out
how to merge the changes into the new schema.
Inordertomeettherequirements,myscriptsmust:
1.INSERTrowsinthenewtablesbasedonanydatainthesourcethathasn‘t alreadybeen created
in the destination
2.UPDATE rowsinthenewtablesbasedonanydata inthesourcethathasalreadybeeninserted in the
destination
3.DELETErowsinthenewtableswherethesourcedatahasbeendeleted Next
step is to design the architecture for custom ETL solution.
1.createtwonewschemasonthenewdatabase:LEGACYandMIGRATE
2.takeasnapshot ofalldatainthelegacydatabase,andloaditastablesintheLEGACYschema
3.grantread-onlyonalltablesinLEGACY toMIGRATE
4.GrantCRUDonalltablesinthetarget schematoMIGRATE.
InformationTechnology Page33
WEKA
VisualizationFeatures:
WEKA’s visualization allows you to visualize a 2-D plot of the current working relation.
Visualization is very useful in practice, it helps to determine difficulty of the learning problem.
WEKA can visualize single attributes (1-d) and pairs of attributes (2-d), rotate 3-d visualizations
(Xgobi-style). WEKA has “Jitter” option to deal with nominal attributes and to detect “hidden”
data points.
 Access To Visualization From The Classifier, Cluster And Attribute Selection Panel Is
Available From A Popup Menu. Click The Right Mouse Button Over An Entry In TheResult
List To Bring Up The Menu. You Will Be Presented With Options For Viewing Or Saving
The Text Output And --- Depending On The Scheme --- Further Options For Visualizing
Errors, Clusters, Trees Etc.
ToopenVisualizationscreen, click‘Visualize’ tab.
Select a square that corresponds to the attributes you would like to visualize. For example, let’s
choose ‘outlook’ for X – axis and ‘play’ for Y – axis. Click anywhere inside the square that
corresponds to ‘play on the left and ‘outlook’ at the top
InformationTechnology Page34
Changingthe View:
Inthe visualizationwindow, beneaththe X-axis selectorthere is a drop-downlist, ‘Colour’, for choosing
thecolorscheme. Thisallows youto choosethecolorofpointsbased ontheattributeselected. Belowthe plot
area,there isa legendthat describes what valuesthe colors correspond to. In your example, red represents
‘no’, while blue represents ‘yes’. For better visibility you should change the color of label ‘yes’. Left-
click on ‘yes’ inthe ‘Class colour’ box and select lighter color fromthe colorpalette.
Totheright oftheplot areathereareseriesofhorizontalstrips. Eachstriprepresentsan attribute,
and the dots within it show the distribution values of the attribute. You can choose
what axes are used inthe maingraph byclicking onthese strips (left-click changes X-axis, right-
click changes Y-axis).
The software setsX - axis to ‘Outlook’ attribute and Y - axis to ‘Play’. The instances are spread
out in the plot area and concentration points are not visible. Keep sliding ‘Jitter’, a random
displacement given to all points in the plot, to the right, until you can spot concentration points.
The results are shown below. But on this screen we changed ‘Colour’ to temperature. Besides
‘outlook’ and ‘play’, this allows you to see the ‘temperature’ corresponding to the ‘outlook’. It
will affect your result because if you see ‘outlook’ = ‘sunny’ and ‘play’ = ‘no’ to explain the
result, you need to see the ‘temperature’ – if it is too hot, you do not want to play. Change
‘Colour’ to ‘windy’, you can see that if it is windy, you do not want to play as well.
SelectingInstances
Sometimes it is helpful to select a subset of the data using visualization tool. A special
case isthe ‘UserClassifier’, which lets youto build yourownclassifier byinteractivelyselecting
instances. Below the Y – axis there is a drop-down list that allows you to choose a selection
method. A groupofpoints onthe graph can be selected in four ways [2]:
InformationTechnology Page35
1. SelectInstance. Click onan individualdata point.It bringsup a window listing attributes of
the point. If more than one point will appear at the same location, more than one set of
attributes will be shown.
2. Rectangle.Youcancreatearectanglebydraggingitaroundthepoint.
3. Polygon. Youcanselect severalpointsbybuilding afree-formpolygon. Left-clickonthe
graphto add vertices to the polygonand right-click to complete it.
InformationTechnology Page36
4. Polyline.Todistinguishthepointsononeside fromtheonceonanother,youcan builda
polyline. Left-clickonthe graphto addverticesto thepolylineand right-click to finish.
InformationTechnology Page37
B) ExploreWEKADataMining/MachineLearningToolkit.
InstallStepsforWEKA aDataMiningTool
1. Download the software as your requirements from the below given
link.http://www.cs.waikato.ac.nz/ml/weka/downloading.html
2. TheJava ismandatoryforinstallationofWEKAso ifyouhavealreadyJavaonyour
machine then download only WEKA else download the software withJVM.
3. Thenopenthefilelocationand doubleclick onthefile
4. ClickNext
InformationTechnology Page38
5. ClickIAgree.
6. As your requirement dothenecessarychangesofsettingsandclickNext. Fulland
Associate files are the recommended settings.
InformationTechnology Page39
7. Changetoyourdesireinstallationlocation.
8. Ifyou wantashortcutthencheck theboxand clickInstall.
InformationTechnology Page40
9. TheInstallationwillstartwaitforawhileitwillfinishwithinaminute.
10. AftercompleteinstallationclickonNext.
11. ClickontheFinishand startMining.
InformationTechnology Page41
ThisistheGUIyougetwhenstarted.Youhave4optionsExplorer,Experimenter,Knowledge Flow and
Simple CLI.
UnderstandthefeaturesofWEKAtoolkitsuchas Explorer,Knowledgeflowinterface,
Experimenter, command-line interface.
WEKA
Weka is created by researchers at the university WIKATO in New Zealand. University of Waikato,
Hamilton, New Zealand Alex Seewald (original Command-line primer) David Scuse (original
Experimenter tutorial)
 Itisjavabasedapplication.
 Itiscollectionoftensource, MachineLearningAlgorithm.
 Theroutines(functions)are implementedasclassesand logicallyarrangedinpackages.
 ItcomeswithanextensiveGUIInterface.
 Wekaroutinescanbeusedstandaloneviathecommand line interface. The
Graphical User Interface;-
The Weka GUI Chooser (class weka.gui.GUIChooser) provides a starting point forlaunching
Weka’s main GUI applications and supporting tools. If one prefers a MDI (“multiple document
interface”) appearance, then this is provided by an alternative launcher called “Main” (class
weka.gui.Main). The GUI Chooser consists of four buttons—one for each of the four major Weka
applications—and four menus.
InformationTechnology Page42
Thebuttonscanbeusedtostartthefollowingapplications:
 Explorer An environment for exploring data with WEKA (the rest ofthisDocumentation
deals with this application in more detail).
 Experimenter Anenvironmentforperformingexperimentsandconductingstatistical tests
between learning schemes.
 Knowledge Flow This environment supports essentiallythe same functions as the Explorerbut
with a drag-and-drop interface. One advantage isthat it supports incremental learning.
 SimpleCLI Provides a simple command-line interface that allows direct execution ofWEKA
commands for operating systems that do not provide their own command line interface.
1. Explorer:TheGraphicaluserinterface
Section Tabs
At the very top of the window, just below the title bar, is a row of tabs. When the Explorer is first
started only the first tab is active; the others are grayed out. This is because it is necessary to open
(and potentially pre-process) a data set before starting to explore thedata.
Thetabsareasfollows:
1. Preprocess.Chooseandmodifythedatabeingactedon.
2. Classify. Train&testlearning schemesthatclassifyor performregression
3. Cluster.Learnclustersforthedata.
4. Associate.Learnassociationrulesforthedata.
InformationTechnology Page43
5. Selectattributes.Selectthe mostrelevantattributes inthedata.
6. Visualize.Viewaninteractive2Dplotofthedata.
Once the tabs are active, clicking on them flicks between different screens, on which the
respective actions can be performed. The bottom area of the window (including the status box, the
log button, and the Weka bird) stays visible regardless of which section you arein. The Explorer can
be easily extended with custom tabs.
2.WekaExperimenter:-
The Weka Experiment Environment enables the user to create, run, modify, and analyze
experiments in a more convenient manner than is possible when processing the schemes
individually. For example, the user can create an experiment that runs several schemes against a
series of datasets and then analyze the results to determine if one of the schemes is (statistically)
better than the other schemes.
The Experiment Environment can be run fromthe command line using the Simple CLI. For
example, the following commands could be typed into the CLI to run the OneR scheme on the Iris
datasetusingabasictrainandtest process.(Notethatthecommandswould betypedononeline into the
CLI.) While commands can be typed directly into the CLI, this technique is not particularly
convenient and the experiments are not easy to modify. The Experimenter comes in two flavors’,
either with a simple interface that provides most of the functionality one needs for experiments, or
with an interface with full access to the Experimenter’s capabilities. You can choose between
those two with the Experiment Configuration Mode radio buttons:
 Simple
 Advanced
InformationTechnology Page44
Both setups allow you to setup standard experimentsthat are run locallyon a single machine, or
remote experiments, which are distributed between several hosts. The distribution of experiments
cuts downthe time the experiments willtake until completion, but onthe other hand the setup takes
more time. The next section coversthe standard experiments (both, simple and advanced), followed
by the remote experiments and finally the analyzing of theresults.
InformationTechnology Page45
Experiment5:ImplementationofApriorialgorithm
Associationrule mining isdefined as: Let be a setofnbinaryattributescalled items. Let be a set
oftransactionscalled thedatabase. Eachtransaction in D hasauniquetransactionIDand contains a
subset of the itemsin I. A ruleis defined as an implication of the form X=>Y where X,Y C I
andXΠY=Φ . Thesetsofitems(forshort itemsets)XandYarecalledantecedent (left handside or LHS)
and consequent (right hand side or RHS) of the rule respectively.To illustrate the concepts, we use
a small example from the supermarket domain.
The set of items is I = {milk,bread,butter,beer} and a small database containing the items (1
codes presence and 0 absence of an item in a transaction) is shown in the table to the right. An
example rule for the supermarket could be meaning that if milk and bread is bought, customers
also buy butter.
To select interesting rules from the set of all possible rules, constraints on various measures of
significance and interest can be used. The best known constraints are minimum thresholds on
support and confidence. The support supp(X) of an itemset X is defined as the proportion of
transactions in the data set which contain the itemset. Inthe example database, the itemset {milk,
bread} has a support of 2 / 5 = 0.4 since it occurs in 40% of all transactions (2 out of 5
transactions).
The confidence of a rule is defined. For example, the rule has a confidence of 0.2 / 0.4 = 0.5 in
the database, which means that for 50% of the transactions containing milk and bread the rule is
correct. Confidence can be interpreted as an estimate of the probability P(Y | X), the probabilityof
finding the RHS of the rule in transactions under the condition that these transactions also contain
the LHS.
ALGORITHM:
Association rule mining is to find out association rules that satisfy the predefined minimum
support and confidence from a given database. The problem is usually decomposed into two sub
problems. One is to find those itemsets whose occurrences exceed a predefined threshold in the
database; those itemsets are called frequent or large itemsets. The second problem is to generate
association rules from those large itemsets with the constraints of minimal confidence.
Supposeoneof thelarge itemsets isLk, Lk ={I1,I2, …,Ik},associationruleswiththis itemsets are
generated in the following way: the first rule is {I1, I2, … , Ik1} and {Ik}, by checking the
confidence thisrule can be determined as interesting or not.
Then other rule are generated by deleting the last items in the antecedent and inserting it to the
consequent, further the confidences ofthe new rules are checked to determine the interestingness
of them. Those processes iterated until the antecedent becomes empty. Since the second
subproblem is quite straight forward, most of the researches focus on the first subproblem.
InformationTechnology Page46
TheApriorialgorithmfindsthefrequent sets LInDatabaseD. Find
frequent set Lk − 1.
· JoinStep.
.CkisgeneratedbyjoiningLk−1with itself
· PruneStep.
oAny(k−1)itemsetthat isnot frequent cannot beasubset ofafrequent kitemset,henceshould be
removed.
Where· (Ck:Candidateitemsetofsizek)
· (Lk:frequentitemsetofsizek)AprioriPseudocode
Apriori(T,£)
L<{Large1itemsetsthatappearinmorethantransactions}
whileL(k1)≠Φ C(k)<Generate(Lk−1)fortransactionst€T
C(t)Subset(Ck,t)
forcandidates c € C(t)
count[c]<count[c]+1L(k)<{c
€C(k)|count[c]≥£ K<K+1 returnỤL(k)k.
StepsforrunApriorialgorithminWEKA
o OpenWEKATool.
o Click onWEKAExplorer.
• ClickonPreprocessing tabbutton.
• Clickonopenfile button.
• ChooseWEKAfolderinCdrive.
o Selectand Clickondataoptionbutton.
o ChooseWeather datasetand openfile.
o ClickonAssociatetaband Choose Apriorialgorithm
o Clickonstartbutton.
InformationTechnology Page47
(X Y).count
Y
AssociationRule:
An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent is an item found in
the data. A consequent is an itemthat is found in combination with the antecedent.
Association rules are created by analyzing data for frequent if/then patterns and using the criteria support and
confidence to identify the most important relationships. Support is an indication of how frequently the items
appear in the database. Confidence indicates the number of times the if/then statements have been found to be
true.
In data mining, association rules are useful for analyzing and predicting customer behavior. They playan
important part in shopping basket data analysis, product clustering, catalog design and store layout.
SupportandConfidencevalues:
 Support count: The support countofan itemset X, denoted byX.count, ina data set T isthe
number of transactions inT that contain X. Assume T has n transactions.
 Then,
support
n
confidence
(X ).count
X.count
support=support({AUC})
confidence =support({AU C})/support({A})
Exercise5:
Apply the Apriori algorithm on Airport noise monitoring data set, Discriminating
betweenpatients with Parkinsons and neurological diseases using voice recordings dataset.
[https://archive.ics.uci.edu/ml/machine-learning-databases/00000/refer this link for datasets]
InformationTechnology Page48
RECORDNOTES
InformationTechnology Page49
InformationTechnology Page50
InformationTechnology Page51
InformationTechnology Page52
InformationTechnology Page53
InformationTechnology Page54
Experiment6:ImplementationofFP-Growthalgorithm
By using the FP-Growth method, the number of scans of the entire database can be reduced to
two. The algorithm extracts frequent item sets that can be used toextract association rules. This is
done using the support of an item set. The main idea of the algorithm is to use a divide and
conquerstrategy:
Compress the database which provides the frequent sets; then divide this compressed database
into a set of conditional databases, each associated with a frequent set and applydata mining on
each database.
To compress the data source, a specialdata structure called the FP-Tree is needed [26]. The tree
is used for the data mining part.
Finally the algorithm works in two steps:
1. ConstructionoftheFP-Tree
2. Extractfrequentitemsets
Construction oftheFP-Tree
The FP-Tree is a compressed representation of the input. While reading the data source each
transaction t is mapped to a path in the FP-Tree. As different transaction can have several items
in common, their path may overlap. With this it is possible to compress the structure.
ThebelowfigureshowsanexampleforthegenerationofanFP-treeusing10transactions.
Extractfrequent item sets
A bottom-up strategy starts with the leaves and moves up to the root using a divide and conquer
strategy. Because every transaction is mapped on a path in the FP-Tree, it is possible to mine
frequent item sets ending in a particular item.
Steps:
o OpenWEKATool.
o Click onWEKAExplorer.
• ClickonPreprocessing tabbutton.
InformationTechnology Page55
• Clickonopenfile button.
• ChooseWEKAfolderinCdrive.
o Selectand Clickondataoptionbutton.
o Chooseadatasetand openfile.
o ClickonAssociatetabandChooseFP-Growthalgorithm
o Clickonstartbutton.
Exercise6:
ApplyFP-GrowthalgorithmonBloodTransfusionServiceCenterdataset
InformationTechnology Page56
RECORDNOTES
InformationTechnology Page57
InformationTechnology Page58
InformationTechnology Page59
InformationTechnology Page60
InformationTechnology Page61
InformationTechnology Page62
Experiment7:ImplementationofDecisionTreeInduction Steps
to model decision tree.
1. Doubleclickoncredit-g.arfffile.
2. Considerallthe21attributes formakingdecisiontree.
3. Clickonclassifytab.
4. Clickonchoose button.
5. ExpandtreefolderandselectJ48
6. Clickonusetrainingsetintestoptions.
7. Clickonstartbutton.
8. Rightclickonresultlistandchoosethevisualizetreetogetdecisiontree.
WecreatedadecisiontreebyusingJ48Technique forthecompletedataset asthetrainingdata. The
following model obtained after training.
Exercise7:
Applydecision treealgorithmto bookatableinahotel/bookatrainticket/movieticket.
InformationTechnology Page63
RECORDNOTES
InformationTechnology Page64
InformationTechnology Page65
InformationTechnology Page66
InformationTechnology Page67
InformationTechnology Page68
InformationTechnology Page69
Experiment8:calculatinginformationgain measures.
Information gain(IG)measures how much “information” a feature gives us about the class. –
Features that perfectly partition should give maximalinformation. – Unrelated features should
give no information. It measuresthe reduction in entropy. CfsSubsetEval aims to identify a
subset of attributes that are highly correlated with the target while not being strongly correlated
with one another. It searches through the space of possible attribute subsets for the “best” one
using the BestFirst search method by default, although other methods can be chosen. To use the
wrapper method rather than a filter method, such as CfsSubsetEval, first select
WrapperSubsetEval and then configure it by choosing a learning algorithm to apply and setting
the number of cross-validation folds to use when evaluating it on each attribute subset.
Steps:
oOpenWEKATool.
oClick onWEKAExplorer.
• ClickonPreprocessing tabbutton.
• Clickonopenfile button.
oSelectand Clickondataoptionbutton.
oChooseadatasetand openfile.
oClickonselectattributetabandChooseattributeevaluator, searchmethodalgorithm
oClickonstartbutton.
InformationTechnology Page70
Exercise8:
Calculatetheinformation gainonweatherdataset(foreach attributes separately).
InformationTechnology Page71
RECORDNOTES
InformationTechnology Page72
InformationTechnology Page73
InformationTechnology Page74
InformationTechnology Page75
InformationTechnology Page76
InformationTechnology Page77
Experiment9:classificationofdatausingBayesianapproach
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical
classifiers. Bayesianclassifierscanpredict class membershipprobabilitiessuchastheprobability that
a given tuple belongs to a particular class.
Steps:
1. OpenWEKATool.
2. Click onWEKAExplorer.
3. ClickonPreprocessingtabbutton.
4. Clickonopenfile button.
5. ChooseWEKAfolderinCdrive.
6. Selectand Clickondataoption button.
7. Chooseadatasetand openfile.
8. Clickonclassifytaband ChooseNaïve-bayesalgorithmandselectusetraining settestoption.’
9. Clickonstartbutton.
InformationTechnology Page78
Exercise9
Classifydata(lungcancer/diabetes/liverdisorder)usingBayesianapproach.
InformationTechnology Page79
RECORDNOTES
InformationTechnology Page80
InformationTechnology Page81
InformationTechnology Page82
InformationTechnology Page83
InformationTechnology Page84
InformationTechnology Page85
Experiment10:classification ofdatausingK-nearest neighborapproach
K nearest neighbors is a simple algorithmthat stores all available cases and classifies new
cases based on a similarity measure (e.g., distance functions). K-nearest neighbors (KNN)
algorithm is a type ofsupervised ML algorithm which can be used for both classification as well
as regression predictive problems. However, it is mainly used for classification predictive
problems in industry. The following two properties would define KNN well −
 Lazylearningalgorithm−KNNisalazylearningalgorithmbecauseitdoesnothavea specialized
training phase and uses all the data for training while classification.
 Non-parametric learningalgorithm−KNN isalso anon-parametric learningalgorithmbecause it
doesn’t assume anything about the underlying data.
Steps:
1. OpenWEKATool.
2. ClickonWEKAExplorer.
3. ClickonPreprocessingtabbutton.
4. Clickonopenfilebutton.
5. ChooseWEKAfolderinCdrive.
6. Selectand Click ondataoption button.
7. Chooseadatasetandopenfile.
8. ClickonclassifytabandChoosek-nearestneighbor andselectusetrainingsettestoption.
9. Clickonstartbutton.
Exercise10:
PerformanalysisonirisdatasetandbuildclusterusingK-nearestneighborapproach.
InformationTechnology Page86
RECORDNOTES
InformationTechnology Page87
InformationTechnology Page88
InformationTechnology Page89
InformationTechnology Page90
InformationTechnology Page91
InformationTechnology Page92
Experiment 11:implementation ofK-means algorithm
Kmeansalgorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined
distinctnon-overlappingsubgroups(clusters)whereeachdatapointbelongstoonlyonegroup.It tries to
make the inter-cluster datapoints as similar as possible while also keeping the clusters as
different (far) as possible. It assigns data points to a cluster such that the sum of the squared
distance betweenthe data points and the cluster’scentroid (arithmetic mean ofallthe data points
thatbelongtothatcluster)isattheminimum.Thelessvariationwehavewithinclusters,themore
homogeneous (similar) the data points are within the same cluster.
Steps:
1. OpenWEKATool.
2. ClickonWEKAExplorer.
3. ClickonPreprocessingtabbutton.
4. Clickonopenfilebutton.
5. ChooseWEKAfolderinCdrive.
6. SelectandClickondataoptionbutton.
7. Chooseadatasetandopenfile.
8. ClickonclustertabandChoosek-means algorithm.
9. Clickonstartbutton.
Exercise11:
ImplementofK-meansclusteringusingcrimedataset.
InformationTechnology Page93
RECORDNOTES
InformationTechnology Page94
InformationTechnology Page95
InformationTechnology Page96
InformationTechnology Page97
InformationTechnology Page98

More Related Content

Similar to Data Mining Lab Manual.docx

Create a basic performance point dashboard epc
Create a basic performance point dashboard   epcCreate a basic performance point dashboard   epc
Create a basic performance point dashboard epcEPC Group
 
ConnectSMART Tutorials
ConnectSMART TutorialsConnectSMART Tutorials
ConnectSMART TutorialsConnectSMART
 
Introduction to weka
Introduction to wekaIntroduction to weka
Introduction to wekaJK Knowledge
 
IntoTheNebulaArticle.pdf
IntoTheNebulaArticle.pdfIntoTheNebulaArticle.pdf
IntoTheNebulaArticle.pdfDavid Harrison
 
IntoTheNebulaArticle.pdf
IntoTheNebulaArticle.pdfIntoTheNebulaArticle.pdf
IntoTheNebulaArticle.pdfDavid Harrison
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using wekarathorenitin87
 
Introduction to STATA(2).pdf
Introduction to STATA(2).pdfIntroduction to STATA(2).pdf
Introduction to STATA(2).pdfYomif3
 
Micro station v8 manual
Micro station v8 manualMicro station v8 manual
Micro station v8 manualyuri30
 
Week 2 Project - STAT 3001Student Name Type your name here.docx
Week 2 Project - STAT 3001Student Name Type your name here.docxWeek 2 Project - STAT 3001Student Name Type your name here.docx
Week 2 Project - STAT 3001Student Name Type your name here.docxcockekeshia
 
Data_Processing_Program
Data_Processing_ProgramData_Processing_Program
Data_Processing_ProgramNeil Dahlqvist
 
4 creating and modifying objects in access
4 creating and modifying objects in access4 creating and modifying objects in access
4 creating and modifying objects in accessCJ Quijano
 
RockWorks 2021 Training Manual (1).pdf
RockWorks 2021 Training Manual (1).pdfRockWorks 2021 Training Manual (1).pdf
RockWorks 2021 Training Manual (1).pdfAbdilbasitHamid
 
Tutorial on how to load images in crystal reports dynamically using visual ba...
Tutorial on how to load images in crystal reports dynamically using visual ba...Tutorial on how to load images in crystal reports dynamically using visual ba...
Tutorial on how to load images in crystal reports dynamically using visual ba...Aeric Poon
 
DesignSpark PCB Workshop Notes 2018
DesignSpark PCB Workshop Notes 2018DesignSpark PCB Workshop Notes 2018
DesignSpark PCB Workshop Notes 2018DesignSparkGC
 
Oracle General ledger ivas
Oracle General ledger ivasOracle General ledger ivas
Oracle General ledger ivasAli Ibrahim
 
Obiee metadata development
Obiee metadata developmentObiee metadata development
Obiee metadata developmentdils4u
 

Similar to Data Mining Lab Manual.docx (20)

Create a basic performance point dashboard epc
Create a basic performance point dashboard   epcCreate a basic performance point dashboard   epc
Create a basic performance point dashboard epc
 
ConnectSMART Tutorials
ConnectSMART TutorialsConnectSMART Tutorials
ConnectSMART Tutorials
 
Introduction to weka
Introduction to wekaIntroduction to weka
Introduction to weka
 
IntoTheNebulaArticle.pdf
IntoTheNebulaArticle.pdfIntoTheNebulaArticle.pdf
IntoTheNebulaArticle.pdf
 
IntoTheNebulaArticle.pdf
IntoTheNebulaArticle.pdfIntoTheNebulaArticle.pdf
IntoTheNebulaArticle.pdf
 
Libre Office Calc Lesson 5: Working with Data
Libre Office Calc Lesson 5: Working with DataLibre Office Calc Lesson 5: Working with Data
Libre Office Calc Lesson 5: Working with Data
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
 
R12 i setup doc
R12 i setup docR12 i setup doc
R12 i setup doc
 
Arena tutorial
Arena tutorialArena tutorial
Arena tutorial
 
Introduction to STATA(2).pdf
Introduction to STATA(2).pdfIntroduction to STATA(2).pdf
Introduction to STATA(2).pdf
 
Micro station v8 manual
Micro station v8 manualMicro station v8 manual
Micro station v8 manual
 
Week 2 Project - STAT 3001Student Name Type your name here.docx
Week 2 Project - STAT 3001Student Name Type your name here.docxWeek 2 Project - STAT 3001Student Name Type your name here.docx
Week 2 Project - STAT 3001Student Name Type your name here.docx
 
Data_Processing_Program
Data_Processing_ProgramData_Processing_Program
Data_Processing_Program
 
4 creating and modifying objects in access
4 creating and modifying objects in access4 creating and modifying objects in access
4 creating and modifying objects in access
 
RockWorks 2021 Training Manual (1).pdf
RockWorks 2021 Training Manual (1).pdfRockWorks 2021 Training Manual (1).pdf
RockWorks 2021 Training Manual (1).pdf
 
Tutorial on how to load images in crystal reports dynamically using visual ba...
Tutorial on how to load images in crystal reports dynamically using visual ba...Tutorial on how to load images in crystal reports dynamically using visual ba...
Tutorial on how to load images in crystal reports dynamically using visual ba...
 
DesignSpark PCB Workshop Notes 2018
DesignSpark PCB Workshop Notes 2018DesignSpark PCB Workshop Notes 2018
DesignSpark PCB Workshop Notes 2018
 
Oracle General ledger ivas
Oracle General ledger ivasOracle General ledger ivas
Oracle General ledger ivas
 
Obiee metadata development
Obiee metadata developmentObiee metadata development
Obiee metadata development
 
Acutate erd pro
Acutate erd proAcutate erd pro
Acutate erd pro
 

Recently uploaded

the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 

Recently uploaded (20)

the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 

Data Mining Lab Manual.docx

  • 2. DATAWAREHOUSING ANDDATA MINING LAB INDEX S.No Nameofthe Experiment PgNo Date Signature 1 DataProcessingTechniques: (i) DataCleaning (ii) DataTransformation-Normalization (iii) DataIntegration 1 2 DataWarehouseSchemas:Star,Snowflake,Fact Constellation 12 3 DataCubeConstruction-OLAPoperations 21 4 DataExtraction,Transformations,Loadingoperations 31 5 ImplementationofApriorialgorithm 45 6 ImplementationofFP-Growth algorithm 54 7 ImplementationofDecisionTree Induction 62 8 Calculatinginformationgainmeasures 69 9 ClassificationofdatausingBayesianapproach 77 10 ClassificationofdatausingK-NearestNeighbor approach 85 11 ImplementationofK-Meansalgorithm 92
  • 3. InformationTechnology Page1 Experiment1:Performdatapreprocessingtasks Preprocess Tab 1. Loading Data The first fourbuttonsatthetopofthepreprocesssectionenable youtoloaddatainto WEKA: 1. Openfile........Bringsup adialogboxallowingyou tobrowseforthedatafileonthelocalfile system. 2. OpenURL.......Asksfor aUniformResourceLocatoraddressforwherethedataisstored. 3. OpenDB........Readsdatafromadatabase.(Notethat to makethisworkyoumighthavetoedit thefileinweka/experiment/DatabaseUtils.props.) 4. Generate.......Enablesyouto generateartificialdatafroma varietyofData Generators.Using theOpenfile......buttonyoucanreadfilesinavarietyofformats:WEKA’sARFFformat, CSV format, C4.5format,orserializedInstancesformat.ARFFfilestypicallyhavea.arffextension,CSV filesa.csv extension, C4.5 files a .data and .names extension, and serialized Instances objects a .bsi extension. Current Relation: Once some data has been loaded, the Preprocess panel shows a variety of information.TheCurrentrelationbox(the―current relation‖isthe currentlyloadeddata, which can be interpreted as a single relationaltable in database terminology) has three entries: 1. Relation.Thenameoftherelation,asgiveninthe fileitwasloadedfrom. Filters(described
  • 4. InformationTechnology Page2 below)modifythenameofarelation. 2. Instances.Thenumberofinstances(datapoints/records)inthedata. 3. Attributes. Thenumber ofattributes(features)inthedata. WorkingwithAttributes BelowtheCurrentrelationboxisaboxtitled Attributes.Therearefourbuttons,and beneath them is a list of the attributes in the current relation. The list hasthreecolumns: 1. No. Anumberthat identifiestheattributeintheordertheyarespecifiedinthe datafile. 2. Selectiontickboxes. Theseallowyouselectwhichattributesarepresentintherelation. 3. Name. The name ofthe attribute, as it was declared in the data file. When you click ondifferent rowsinthe list ofattributes,thefieldschange intheboxtotherighttitledSelected attribute. Thisboxdisplaysthecharacteristicsofthecurrentlyhighlightedattributeinthelist: 1. Name.Thenameoftheattribute,thesameasthat given intheattributelist. 2. Type. Thetypeofattribute,mostcommonlyNominalorNumeric. 3. Missing.Thenumber(andpercentage)ofinstancesinthedataforwhichthisattribute is missing (unspecified). 4. Distinct.Thenumberofdifferentvaluesthatthedatacontains forthisattribute. 5. Unique.Thenumber(andpercentage)ofinstancesinthedatahavingavalue forthisattribute that no other instances have. Below these statistics is a list showing more information about the values stored in this attribute, which differ depending on its type. If the attribute is nominal, the list consists of each possible value for the attribute along with the number of instances that have that value. If the attribute is numeric, the list gives four statistics describing the distribution of values in the data— the minimum, maximum, mean and standard deviation. And below these statistics there is a coloured histogram, colour-coded according to the attribute chosen as the Class using the boxabovethe histogram. (This boxwillbringupadrop-downlist ofavailable selectionswhenclicked.) Note that only nominal Class attributes will result in a color-coding. Finally, after pressing the Visualize All button, histograms for all the attributes in the data are shown in a separate window. Returning to the attribute list, to begin with all the tick boxes are unticked. They can be toggled on/off by clicking on them individually. The four buttons above can also be used to change the selection.
  • 5. InformationTechnology Page3 PREPROCESSING 1. All.Allboxesareticked. 2. None.Allboxesarecleared(unticked). 3. Invert.Boxesthataretickedbecomeuntickedandvice versa. 4. Pattern.Enablestheusertoselect attributesbasedonaPerl5RegularExpression.E.g.,.*id selects all attributes which name ends with id. Once the desired attributes have been selected, they can be removed by clicking the Remove buttonbelow the list ofattributes. Notethat this can be undone byclicking the Undo button, which is located next to the Edit button in the top-right corner of the Preprocess panel. Working withFilters:- The preprocess section allows filters to be defined that transform the data in various ways. The Filter box is used to set up the filters that are required. At the left of the Filter box is a Choose button. Byclicking this buttonit is possible to selectone ofthe filters inWEKA. Once a filter has been selected, its name and options are shown in the field next to the Choose button. Clicking on this box with the left mouse button brings up a GenericObjectEditor dialog box. A click with the right mouse button (or Alt+Shift+left click) brings up a menu where you can choose, either to display the properties in a GenericObjectEditor dialog box, or to copy the current setup string to theclipboard.
  • 6. InformationTechnology Page4 TheGenericObjectEditorDialogBox TheGenericObjectEditordialog boxlets youconfigureafilter. Thesamekind ofdialog box is used to configureotherobjects, suchasclassifiersand clusterers. The fields in the window reflect the available options. Right-clicking(orAlt+Shift+Left-Click)onsucha fieldwillbringupapopup menu, listingthe following options: 1. Show properties... has the same effect as left-clicking on the field, i.e., a dialogappears allowing you to alter the settings. 2. Copy configuration to clipboard copies the currently displayed configuration string to the system’s clipboard and therefore can be used anywhere else in WEKA or in the console. This is rather handy if you have to setup complicated, nested schemes. 3. Enter configuration... is the ―receiving‖ end for configurations that got copied to the clipboard earlier on. In this dialog you can enter a class name followed by options (if the class supports these). This also allows you to transfer a filter setting from the Preprocess panel to a Filtered Classifier used in the Classify panel. Left-Clicking on anyof these gives an opportunityto alter the filters settings. For example, the settingmaytake a textstring,in which caseyou type the stringinto the text field provided. Or it may give a drop-down box listing several states to choose from. Or it may do something else, depending on the information required. Information on the options is providedin a tool tipif you let the mouse pointer hover of the corresponding field. More information on the filter and its options can be obtained by clicking on the More button in the About panel at the top of the GenericObjectEditor window. ApplyingFilters Once you have selected and configured a filter, you can apply it to the data bypressing the Apply button at the right end of the Filter panel in the Preprocess panel. The Preprocess panelwill then show the transformed data. The change can be undone by pressing the Undo button. You can also usetheEdit...buttonto modifyyourdatamanuallyinadataset editor. Finally, theSave...button at the top right of the Preprocess panel saves the current version of the relation in file formats that can represent the relation, allowing it to be kept for future use. Stepsforrunpreprocessing tab in WEKA: 1. OpenWEKATool. 2. Click onWEKAExplorer. 3. ClickonPreprocessingtabbutton.
  • 7. InformationTechnology Page5 4. Clickonopenfile button. 5. ChooseWEKAfolderinCdrive. 6. Selectand Clickondataoption button. 7. Chooselabordataset and openfile. 8. ChoosefilterbuttonandselecttheUnsupervised-Discritizeoptionandapply Datasetlabor.arff Exercise1: a. Removespacesfromthefileby using Java. b. Normalizethedatausingmin-maxnormalization
  • 14. InformationTechnology Page12 Experiment 2: Design multi-dimensional data models namely Star, Snowflake and Fact Constellation schemas for any one enterprise (ex. Banking, Insurance, Finance, Healthcare, manufacturing, Automobiles, sales etc). SchemaDefinition Multidimensional schema is defined using Data Mining Query Language (DMQL). The two primitives, cube definitionand dimensiondefinition, can be used for defining the data warehouses and data marts. StarSchema  Eachdimension inastarschemaisrepresented withonlyone-dimensiontable.  Thisdimensiontablecontainsthesetofattributes.  Thefollowingdiagramshowsthesalesdataofacompanywithrespecttothefour dimensions, namely time, item, branch, and location.  Thereisafacttableatthecenter.Itcontainsthekeystoeachoffourdimensions.  Thefacttablealsocontainstheattributes,namelydollarssoldandunitssold.
  • 15. InformationTechnology Page13 SnowflakeSchema  SomedimensiontablesintheSnowflakeschemaarenormalized.  Thenormalizationsplitsupthedataintoadditionaltables.  UnlikeStarschema,thedimensionstableinasnowflakeschemais normalized. For example,theitemdimensiontableinstarschemaisnormalizedandsplitintotwo dimension tables, namely item and supplier table.  Nowtheitemdimensiontablecontainstheattributesitem_key, item_name,type,brand, and supplier-key.  Thesupplierkeyislinkedtothesupplierdimensiontable.Thesupplierdimensiontable contains the attributes supplier_key and supplier_type.
  • 16. InformationTechnology Page14 FactConstellationSchema  Afactconstellationhasmultiple facttables.Itisalsoknownasgalaxyschema.  Thefollowingdiagramshowstwofacttables,namelysalesandshipping.  Thesalesfacttableissameasthatinthestar schema.  Theshippingfacttablehasthefivedimensions,namelyitem_key,time_key,shipper_key, from_location, to_location.  Theshippingfacttablealsocontainstwomeasures,namelydollarssoldandunitssold.  Itisalsopossibletosharedimensiontablesbetweenfacttables.Forexample,time,item, and location dimension tables are shared between the sales and shipping fact table. Exercise2: DesigndatawarehouseschemasforBankingapplication.
  • 23. InformationTechnology Page21 Experiment3:PerformVariousOLAPoperationssuch slice,dice,rollup,drillupand pivot. OLAPOPERATIONS Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows managers, and analysts to get an insight of the information through fast, consistent, and interactive access to information. Here is the list of OLAP operations:  Roll-up  Drill-down  Sliceanddice  Pivot(rotate) Roll-up Roll-upperformsaggregationona datacube inanyofthefollowingways:  Byclimbingup aconcepthierarchyforadimension  Bydimensionreduction Thefollowingdiagramillustrateshowroll-up works.
  • 24. InformationTechnology Page22  Roll-up isperformedbyclimbingupaconcepthierarchyforthedimensionlocation.  Initiallytheconcepthierarchywas"street<city<province<country".  Onrollingup,thedataisaggregatedbyascendingthelocationhierarchyfromthelevel o fcity to the level of country.  Thedataisgroupedinto citiesratherthancountries.  Whenroll-upisperformed,oneormoredimensionsfromthedatacubeareremoved. Drill-down Drill-downisthereverseoperationofroll-up.Itisperformedbyeither ofthefollowing ways:  Bystepping downaconcepthierarchyforadimension  Byintroducinganewdimension. Thefollowingdiagramillustrateshowdrill-downworks:  Drill-downisperformedbysteppingdownaconcepthierarchyforthedimensiontime.  Initiallytheconcepthierarchywas"day<month<quarter<year."  Ondrillingdown,thetimedimensionisdescended fromthe levelofquartertothelevel o fmonth.  Whendrill-downisperformed,oneormoredimensionsfromthedatacubeareadded.  Itnavigatesthedatafromlessdetailed datatohighlydetaileddata.
  • 25. InformationTechnology Page23 Slice Theslice operationselectsoneparticular dimensionfromagivencubeandprovidesanewsub- cube. Consider the following diagram that shows how slice works.  HereSliceisperformedforthedimension"time"usingthecriteriontime="Q1".  Itwillformanewsub-cubebyselecting oneormoredimensions. Dice Diceselectstwoormoredimensions fromagivencubeandprovidesanewsub-cube. Consider the following diagram that shows the dice operation.
  • 26. InformationTechnology Page24 Thediceoperationonthecubebased onthefollowing selectioncriteriainvolvesthreedimensions.  (location="Toronto"or"Vancouver") (time = "Q1" or "Q2")  (item=" Mobile" or"Modem") Pivot The pivot operation is also known as rotation. It rotatesthe data axes in view in order to provide analternativepresentationofdata.Considerthe followingdiagramthat showsthepivotoperation. Exercise3: ApplytheOLAPoperationsfortheabovebanking application.
  • 33. InformationTechnology Page31 Experiment4: ETLscriptsandimplementusingdatawarehousetools. ETL comes fromData Warehousing and stands for Extract-Transform-Load. ETL covers a process of how the data are loaded from the source system to the data warehouse. Extraction– transformation–loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, its cleansing, customization, reformatting, integration, and insertion into a data warehouse. Building the ETL process is potentially one of the biggest tasks of building a warehouse; it is complex, time consuming, and consumes most ofdata warehouse project‘s implementationefforts, costs, and resources. Buildingadatawarehouserequires focusingcloselyonunderstandingthreemainareas: 1. SourceArea-Thesourceareahasstandardmodels suchasentityrelationshipdiagram. 2. DestinationArea-Thedestinationareahasstandard modelssuchasstarschema. 3. MappingArea- Butthemapping areahas notastandardmodeltillnow. ETLProcess: Extract The Extract step covers the data extraction from the source system and makes it accessible for further processing. The main objective of the extract step is to retrieve all the required data fromthe source systemwith as little resources as possible. The extract step should be designed in a way that it does not negatively affect the source system in terms or performance, response time or any kind of locking. Thereare severalwaysto performtheextract:  Update notification- ifthe source systemisable to provide a notificationthat arecord has been changed and describe the change, this is the easiest wayto get the data.  Incremental extract - some systems may not be able to provide notification that an update has occurred, but they are able to identify which records have been modified and provide an extract of such records. During further ETL steps, the system needs to identify changes and propagate itdown. Note, that byusing daily extract, we may not be able to handle deleted recordsproperly.  Full extract - some systems are not able to identify which data has been changed at all, so a ful extract is the only way one can get the data out of the system. The full extract requires keeping a copy of the last extract in the same format in order to be able to identify changes. Full extract handles deletions as well.
  • 34. InformationTechnology Page32 Transform The transform step applies a set of rules to transform the data from the source to the target. This includes converting any measured data to the same dimension(i.e. conformed dimension) using the same units so that they can later be joined. The transformation step also requires joining data from several sources, generating aggregates, generating surrogate keys, sorting, deriving new calculated values, and applying advanced validation rules. Load During the load step, it is necessaryto ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database. In order to make the load processefficient,it ishelpfulto disableanyconstraintsand indexesbeforethe loadandenablethem back only after the load completes. The referential integrity needs to be maintained by ETL tool to ensure consistency. ETLmethod ETL as scripts that canjust be runonthe database. These scripts must be re-runnable: theyshould beableto berunwithout modificationtopickupanychangesinthe legacydata,andautomatically work out how to merge the changes into the new schema. Inordertomeettherequirements,myscriptsmust: 1.INSERTrowsinthenewtablesbasedonanydatainthesourcethathasn‘t alreadybeen created in the destination 2.UPDATE rowsinthenewtablesbasedonanydata inthesourcethathasalreadybeeninserted in the destination 3.DELETErowsinthenewtableswherethesourcedatahasbeendeleted Next step is to design the architecture for custom ETL solution. 1.createtwonewschemasonthenewdatabase:LEGACYandMIGRATE 2.takeasnapshot ofalldatainthelegacydatabase,andloaditastablesintheLEGACYschema 3.grantread-onlyonalltablesinLEGACY toMIGRATE 4.GrantCRUDonalltablesinthetarget schematoMIGRATE.
  • 35. InformationTechnology Page33 WEKA VisualizationFeatures: WEKA’s visualization allows you to visualize a 2-D plot of the current working relation. Visualization is very useful in practice, it helps to determine difficulty of the learning problem. WEKA can visualize single attributes (1-d) and pairs of attributes (2-d), rotate 3-d visualizations (Xgobi-style). WEKA has “Jitter” option to deal with nominal attributes and to detect “hidden” data points.  Access To Visualization From The Classifier, Cluster And Attribute Selection Panel Is Available From A Popup Menu. Click The Right Mouse Button Over An Entry In TheResult List To Bring Up The Menu. You Will Be Presented With Options For Viewing Or Saving The Text Output And --- Depending On The Scheme --- Further Options For Visualizing Errors, Clusters, Trees Etc. ToopenVisualizationscreen, click‘Visualize’ tab. Select a square that corresponds to the attributes you would like to visualize. For example, let’s choose ‘outlook’ for X – axis and ‘play’ for Y – axis. Click anywhere inside the square that corresponds to ‘play on the left and ‘outlook’ at the top
  • 36. InformationTechnology Page34 Changingthe View: Inthe visualizationwindow, beneaththe X-axis selectorthere is a drop-downlist, ‘Colour’, for choosing thecolorscheme. Thisallows youto choosethecolorofpointsbased ontheattributeselected. Belowthe plot area,there isa legendthat describes what valuesthe colors correspond to. In your example, red represents ‘no’, while blue represents ‘yes’. For better visibility you should change the color of label ‘yes’. Left- click on ‘yes’ inthe ‘Class colour’ box and select lighter color fromthe colorpalette. Totheright oftheplot areathereareseriesofhorizontalstrips. Eachstriprepresentsan attribute, and the dots within it show the distribution values of the attribute. You can choose what axes are used inthe maingraph byclicking onthese strips (left-click changes X-axis, right- click changes Y-axis). The software setsX - axis to ‘Outlook’ attribute and Y - axis to ‘Play’. The instances are spread out in the plot area and concentration points are not visible. Keep sliding ‘Jitter’, a random displacement given to all points in the plot, to the right, until you can spot concentration points. The results are shown below. But on this screen we changed ‘Colour’ to temperature. Besides ‘outlook’ and ‘play’, this allows you to see the ‘temperature’ corresponding to the ‘outlook’. It will affect your result because if you see ‘outlook’ = ‘sunny’ and ‘play’ = ‘no’ to explain the result, you need to see the ‘temperature’ – if it is too hot, you do not want to play. Change ‘Colour’ to ‘windy’, you can see that if it is windy, you do not want to play as well. SelectingInstances Sometimes it is helpful to select a subset of the data using visualization tool. A special case isthe ‘UserClassifier’, which lets youto build yourownclassifier byinteractivelyselecting instances. Below the Y – axis there is a drop-down list that allows you to choose a selection method. A groupofpoints onthe graph can be selected in four ways [2]:
  • 37. InformationTechnology Page35 1. SelectInstance. Click onan individualdata point.It bringsup a window listing attributes of the point. If more than one point will appear at the same location, more than one set of attributes will be shown. 2. Rectangle.Youcancreatearectanglebydraggingitaroundthepoint. 3. Polygon. Youcanselect severalpointsbybuilding afree-formpolygon. Left-clickonthe graphto add vertices to the polygonand right-click to complete it.
  • 38. InformationTechnology Page36 4. Polyline.Todistinguishthepointsononeside fromtheonceonanother,youcan builda polyline. Left-clickonthe graphto addverticesto thepolylineand right-click to finish.
  • 39. InformationTechnology Page37 B) ExploreWEKADataMining/MachineLearningToolkit. InstallStepsforWEKA aDataMiningTool 1. Download the software as your requirements from the below given link.http://www.cs.waikato.ac.nz/ml/weka/downloading.html 2. TheJava ismandatoryforinstallationofWEKAso ifyouhavealreadyJavaonyour machine then download only WEKA else download the software withJVM. 3. Thenopenthefilelocationand doubleclick onthefile 4. ClickNext
  • 40. InformationTechnology Page38 5. ClickIAgree. 6. As your requirement dothenecessarychangesofsettingsandclickNext. Fulland Associate files are the recommended settings.
  • 41. InformationTechnology Page39 7. Changetoyourdesireinstallationlocation. 8. Ifyou wantashortcutthencheck theboxand clickInstall.
  • 42. InformationTechnology Page40 9. TheInstallationwillstartwaitforawhileitwillfinishwithinaminute. 10. AftercompleteinstallationclickonNext. 11. ClickontheFinishand startMining.
  • 43. InformationTechnology Page41 ThisistheGUIyougetwhenstarted.Youhave4optionsExplorer,Experimenter,Knowledge Flow and Simple CLI. UnderstandthefeaturesofWEKAtoolkitsuchas Explorer,Knowledgeflowinterface, Experimenter, command-line interface. WEKA Weka is created by researchers at the university WIKATO in New Zealand. University of Waikato, Hamilton, New Zealand Alex Seewald (original Command-line primer) David Scuse (original Experimenter tutorial)  Itisjavabasedapplication.  Itiscollectionoftensource, MachineLearningAlgorithm.  Theroutines(functions)are implementedasclassesand logicallyarrangedinpackages.  ItcomeswithanextensiveGUIInterface.  Wekaroutinescanbeusedstandaloneviathecommand line interface. The Graphical User Interface;- The Weka GUI Chooser (class weka.gui.GUIChooser) provides a starting point forlaunching Weka’s main GUI applications and supporting tools. If one prefers a MDI (“multiple document interface”) appearance, then this is provided by an alternative launcher called “Main” (class weka.gui.Main). The GUI Chooser consists of four buttons—one for each of the four major Weka applications—and four menus.
  • 44. InformationTechnology Page42 Thebuttonscanbeusedtostartthefollowingapplications:  Explorer An environment for exploring data with WEKA (the rest ofthisDocumentation deals with this application in more detail).  Experimenter Anenvironmentforperformingexperimentsandconductingstatistical tests between learning schemes.  Knowledge Flow This environment supports essentiallythe same functions as the Explorerbut with a drag-and-drop interface. One advantage isthat it supports incremental learning.  SimpleCLI Provides a simple command-line interface that allows direct execution ofWEKA commands for operating systems that do not provide their own command line interface. 1. Explorer:TheGraphicaluserinterface Section Tabs At the very top of the window, just below the title bar, is a row of tabs. When the Explorer is first started only the first tab is active; the others are grayed out. This is because it is necessary to open (and potentially pre-process) a data set before starting to explore thedata. Thetabsareasfollows: 1. Preprocess.Chooseandmodifythedatabeingactedon. 2. Classify. Train&testlearning schemesthatclassifyor performregression 3. Cluster.Learnclustersforthedata. 4. Associate.Learnassociationrulesforthedata.
  • 45. InformationTechnology Page43 5. Selectattributes.Selectthe mostrelevantattributes inthedata. 6. Visualize.Viewaninteractive2Dplotofthedata. Once the tabs are active, clicking on them flicks between different screens, on which the respective actions can be performed. The bottom area of the window (including the status box, the log button, and the Weka bird) stays visible regardless of which section you arein. The Explorer can be easily extended with custom tabs. 2.WekaExperimenter:- The Weka Experiment Environment enables the user to create, run, modify, and analyze experiments in a more convenient manner than is possible when processing the schemes individually. For example, the user can create an experiment that runs several schemes against a series of datasets and then analyze the results to determine if one of the schemes is (statistically) better than the other schemes. The Experiment Environment can be run fromthe command line using the Simple CLI. For example, the following commands could be typed into the CLI to run the OneR scheme on the Iris datasetusingabasictrainandtest process.(Notethatthecommandswould betypedononeline into the CLI.) While commands can be typed directly into the CLI, this technique is not particularly convenient and the experiments are not easy to modify. The Experimenter comes in two flavors’, either with a simple interface that provides most of the functionality one needs for experiments, or with an interface with full access to the Experimenter’s capabilities. You can choose between those two with the Experiment Configuration Mode radio buttons:  Simple  Advanced
  • 46. InformationTechnology Page44 Both setups allow you to setup standard experimentsthat are run locallyon a single machine, or remote experiments, which are distributed between several hosts. The distribution of experiments cuts downthe time the experiments willtake until completion, but onthe other hand the setup takes more time. The next section coversthe standard experiments (both, simple and advanced), followed by the remote experiments and finally the analyzing of theresults.
  • 47. InformationTechnology Page45 Experiment5:ImplementationofApriorialgorithm Associationrule mining isdefined as: Let be a setofnbinaryattributescalled items. Let be a set oftransactionscalled thedatabase. Eachtransaction in D hasauniquetransactionIDand contains a subset of the itemsin I. A ruleis defined as an implication of the form X=>Y where X,Y C I andXΠY=Φ . Thesetsofitems(forshort itemsets)XandYarecalledantecedent (left handside or LHS) and consequent (right hand side or RHS) of the rule respectively.To illustrate the concepts, we use a small example from the supermarket domain. The set of items is I = {milk,bread,butter,beer} and a small database containing the items (1 codes presence and 0 absence of an item in a transaction) is shown in the table to the right. An example rule for the supermarket could be meaning that if milk and bread is bought, customers also buy butter. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best known constraints are minimum thresholds on support and confidence. The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. Inthe example database, the itemset {milk, bread} has a support of 2 / 5 = 0.4 since it occurs in 40% of all transactions (2 out of 5 transactions). The confidence of a rule is defined. For example, the rule has a confidence of 0.2 / 0.4 = 0.5 in the database, which means that for 50% of the transactions containing milk and bread the rule is correct. Confidence can be interpreted as an estimate of the probability P(Y | X), the probabilityof finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. ALGORITHM: Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. The problem is usually decomposed into two sub problems. One is to find those itemsets whose occurrences exceed a predefined threshold in the database; those itemsets are called frequent or large itemsets. The second problem is to generate association rules from those large itemsets with the constraints of minimal confidence. Supposeoneof thelarge itemsets isLk, Lk ={I1,I2, …,Ik},associationruleswiththis itemsets are generated in the following way: the first rule is {I1, I2, … , Ik1} and {Ik}, by checking the confidence thisrule can be determined as interesting or not. Then other rule are generated by deleting the last items in the antecedent and inserting it to the consequent, further the confidences ofthe new rules are checked to determine the interestingness of them. Those processes iterated until the antecedent becomes empty. Since the second subproblem is quite straight forward, most of the researches focus on the first subproblem.
  • 48. InformationTechnology Page46 TheApriorialgorithmfindsthefrequent sets LInDatabaseD. Find frequent set Lk − 1. · JoinStep. .CkisgeneratedbyjoiningLk−1with itself · PruneStep. oAny(k−1)itemsetthat isnot frequent cannot beasubset ofafrequent kitemset,henceshould be removed. Where· (Ck:Candidateitemsetofsizek) · (Lk:frequentitemsetofsizek)AprioriPseudocode Apriori(T,£) L<{Large1itemsetsthatappearinmorethantransactions} whileL(k1)≠Φ C(k)<Generate(Lk−1)fortransactionst€T C(t)Subset(Ck,t) forcandidates c € C(t) count[c]<count[c]+1L(k)<{c €C(k)|count[c]≥£ K<K+1 returnỤL(k)k. StepsforrunApriorialgorithminWEKA o OpenWEKATool. o Click onWEKAExplorer. • ClickonPreprocessing tabbutton. • Clickonopenfile button. • ChooseWEKAfolderinCdrive. o Selectand Clickondataoptionbutton. o ChooseWeather datasetand openfile. o ClickonAssociatetaband Choose Apriorialgorithm o Clickonstartbutton.
  • 49. InformationTechnology Page47 (X Y).count Y AssociationRule: An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent is an item found in the data. A consequent is an itemthat is found in combination with the antecedent. Association rules are created by analyzing data for frequent if/then patterns and using the criteria support and confidence to identify the most important relationships. Support is an indication of how frequently the items appear in the database. Confidence indicates the number of times the if/then statements have been found to be true. In data mining, association rules are useful for analyzing and predicting customer behavior. They playan important part in shopping basket data analysis, product clustering, catalog design and store layout. SupportandConfidencevalues:  Support count: The support countofan itemset X, denoted byX.count, ina data set T isthe number of transactions inT that contain X. Assume T has n transactions.  Then, support n confidence (X ).count X.count support=support({AUC}) confidence =support({AU C})/support({A}) Exercise5: Apply the Apriori algorithm on Airport noise monitoring data set, Discriminating betweenpatients with Parkinsons and neurological diseases using voice recordings dataset. [https://archive.ics.uci.edu/ml/machine-learning-databases/00000/refer this link for datasets]
  • 56. InformationTechnology Page54 Experiment6:ImplementationofFP-Growthalgorithm By using the FP-Growth method, the number of scans of the entire database can be reduced to two. The algorithm extracts frequent item sets that can be used toextract association rules. This is done using the support of an item set. The main idea of the algorithm is to use a divide and conquerstrategy: Compress the database which provides the frequent sets; then divide this compressed database into a set of conditional databases, each associated with a frequent set and applydata mining on each database. To compress the data source, a specialdata structure called the FP-Tree is needed [26]. The tree is used for the data mining part. Finally the algorithm works in two steps: 1. ConstructionoftheFP-Tree 2. Extractfrequentitemsets Construction oftheFP-Tree The FP-Tree is a compressed representation of the input. While reading the data source each transaction t is mapped to a path in the FP-Tree. As different transaction can have several items in common, their path may overlap. With this it is possible to compress the structure. ThebelowfigureshowsanexampleforthegenerationofanFP-treeusing10transactions. Extractfrequent item sets A bottom-up strategy starts with the leaves and moves up to the root using a divide and conquer strategy. Because every transaction is mapped on a path in the FP-Tree, it is possible to mine frequent item sets ending in a particular item. Steps: o OpenWEKATool. o Click onWEKAExplorer. • ClickonPreprocessing tabbutton.
  • 57. InformationTechnology Page55 • Clickonopenfile button. • ChooseWEKAfolderinCdrive. o Selectand Clickondataoptionbutton. o Chooseadatasetand openfile. o ClickonAssociatetabandChooseFP-Growthalgorithm o Clickonstartbutton. Exercise6: ApplyFP-GrowthalgorithmonBloodTransfusionServiceCenterdataset
  • 64. InformationTechnology Page62 Experiment7:ImplementationofDecisionTreeInduction Steps to model decision tree. 1. Doubleclickoncredit-g.arfffile. 2. Considerallthe21attributes formakingdecisiontree. 3. Clickonclassifytab. 4. Clickonchoose button. 5. ExpandtreefolderandselectJ48 6. Clickonusetrainingsetintestoptions. 7. Clickonstartbutton. 8. Rightclickonresultlistandchoosethevisualizetreetogetdecisiontree. WecreatedadecisiontreebyusingJ48Technique forthecompletedataset asthetrainingdata. The following model obtained after training. Exercise7: Applydecision treealgorithmto bookatableinahotel/bookatrainticket/movieticket.
  • 71. InformationTechnology Page69 Experiment8:calculatinginformationgain measures. Information gain(IG)measures how much “information” a feature gives us about the class. – Features that perfectly partition should give maximalinformation. – Unrelated features should give no information. It measuresthe reduction in entropy. CfsSubsetEval aims to identify a subset of attributes that are highly correlated with the target while not being strongly correlated with one another. It searches through the space of possible attribute subsets for the “best” one using the BestFirst search method by default, although other methods can be chosen. To use the wrapper method rather than a filter method, such as CfsSubsetEval, first select WrapperSubsetEval and then configure it by choosing a learning algorithm to apply and setting the number of cross-validation folds to use when evaluating it on each attribute subset. Steps: oOpenWEKATool. oClick onWEKAExplorer. • ClickonPreprocessing tabbutton. • Clickonopenfile button. oSelectand Clickondataoptionbutton. oChooseadatasetand openfile. oClickonselectattributetabandChooseattributeevaluator, searchmethodalgorithm oClickonstartbutton.
  • 79. InformationTechnology Page77 Experiment9:classificationofdatausingBayesianapproach Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical classifiers. Bayesianclassifierscanpredict class membershipprobabilitiessuchastheprobability that a given tuple belongs to a particular class. Steps: 1. OpenWEKATool. 2. Click onWEKAExplorer. 3. ClickonPreprocessingtabbutton. 4. Clickonopenfile button. 5. ChooseWEKAfolderinCdrive. 6. Selectand Clickondataoption button. 7. Chooseadatasetand openfile. 8. Clickonclassifytaband ChooseNaïve-bayesalgorithmandselectusetraining settestoption.’ 9. Clickonstartbutton.
  • 87. InformationTechnology Page85 Experiment10:classification ofdatausingK-nearest neighborapproach K nearest neighbors is a simple algorithmthat stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). K-nearest neighbors (KNN) algorithm is a type ofsupervised ML algorithm which can be used for both classification as well as regression predictive problems. However, it is mainly used for classification predictive problems in industry. The following two properties would define KNN well −  Lazylearningalgorithm−KNNisalazylearningalgorithmbecauseitdoesnothavea specialized training phase and uses all the data for training while classification.  Non-parametric learningalgorithm−KNN isalso anon-parametric learningalgorithmbecause it doesn’t assume anything about the underlying data. Steps: 1. OpenWEKATool. 2. ClickonWEKAExplorer. 3. ClickonPreprocessingtabbutton. 4. Clickonopenfilebutton. 5. ChooseWEKAfolderinCdrive. 6. Selectand Click ondataoption button. 7. Chooseadatasetandopenfile. 8. ClickonclassifytabandChoosek-nearestneighbor andselectusetrainingsettestoption. 9. Clickonstartbutton. Exercise10: PerformanalysisonirisdatasetandbuildclusterusingK-nearestneighborapproach.
  • 94. InformationTechnology Page92 Experiment 11:implementation ofK-means algorithm Kmeansalgorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinctnon-overlappingsubgroups(clusters)whereeachdatapointbelongstoonlyonegroup.It tries to make the inter-cluster datapoints as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance betweenthe data points and the cluster’scentroid (arithmetic mean ofallthe data points thatbelongtothatcluster)isattheminimum.Thelessvariationwehavewithinclusters,themore homogeneous (similar) the data points are within the same cluster. Steps: 1. OpenWEKATool. 2. ClickonWEKAExplorer. 3. ClickonPreprocessingtabbutton. 4. Clickonopenfilebutton. 5. ChooseWEKAfolderinCdrive. 6. SelectandClickondataoptionbutton. 7. Chooseadatasetandopenfile. 8. ClickonclustertabandChoosek-means algorithm. 9. Clickonstartbutton. Exercise11: ImplementofK-meansclusteringusingcrimedataset.