What is Data Mining? Process of analyzing data and extracting patterns from it Four step process: Classification – Arrange into predefined groups Clustering – Similar to classification but the groups are not predefined Regression – To find a function that models the data with least error Association rule learning – Search for relationships
Different levels of analysis that are available: Artificial neural networks – Non-linear predictive models that resemble biological neural networks in structure. Genetic algorithms - Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution. Decision trees – Provide a set of rules that you can apply to a new dataset to predict the outcome. Examples:
Classification and Regression Trees (CART)
Chi Square Automatic Interaction Detection (CHAID) .
CART and CHAID are decision tree techniques used for classification of a dataset. Rule induction – The extraction of useful if-then rules from data based on statistical significance. Nearest neighbor – Classify records based on the k-most similar records Data visualization - Visual interpretation of complex relationships in multidimensional data.
Applications Can be divided into four major kinds: Classification Numerical prediction Association Clustering Some examples: Automatic abstraction Financial forecasting Targeted marketing Medical diagnosis Credit card fraud detection Weather forecasting etc.
Introduction to RapidMiner RapidMiner (formerly YALE*)is an environment for machine learning and data mining experiments. RapidMiner is used for both research and real-world data mining tasks. Software versions:
(Community Edition + More Features + Services + Guarantees) *YALE - Yet Another Learning Environment
Some properties of RapidMiner: Written in Java Knowledge discovery processes are modelled as operator trees Internal XML representation ensures standardized interchange format of data mining experiments Scriptinglanguage allows for automating large-scale experiments Multi-layered data view concept ensures efficient and transparent data handling GUI, command-line mode (batch mode), and Java API for using RapidMiner from other programs Several plugins already exist A large set of high-dimensional visualization schemes for data and models offered by its plotting facility. Applications: text mining, multimedia mining, feature engineering, data stream mining and tracking drifting concepts, development of ensemble methods, and distributed data mining.
Use of RapidMiner for Data Mining Using RapidMiner
Process configuration provided as XML file
GUI can be used to design XML description of the operator tree
Break points can be used to check the intermediate results
Use from a separate program Command line version and Java API can be used to invoke RapidMiner in your programs without using the GUI
Download and Installation Steps Download The latest version of RapidMiner can be downloaded from http://rapid-i.com/content/blogsection/7/82/lang,en/ by selecting the appropriate version(Windows x86, x64 etc.) and RapidMiner edition Installation Windows executable Download the windows executable (.exe) file Double-click the rapidminer-xxx-instal.exe file to run it Follow the instructions
JAVA version(any platform) Install JRE 1.5 or above (http://www.java.sun.com) Choose the installation directory and unzip the downloaded file (.zip,.tar etc)
rapidminer-xxx-bin.zip – contains only the binaries
RapidMiner Home directory .rapidminer in your home directory Current working directory File specified by rapidminer.rcfile Order
Supported File Formats Can read data files, read & write models, parameter sets and attribute sets. Most important – examples and instances
Data files & attribute description files ARFFEXAMPLESOURCE - .arff format DATABASEEXAMPLESOURCE – To read from databases SPARSEFORMATEXAMPLESOURCE DENSEFORMATEXAMPLESOURCE Attribute description file (.aml) in order to retrieve metadata about the instances XML Attributes that can be set: Name – unique name of the attribute Sourcefile – name of the file containing the data(default used if not specified) Sourcecol –column within the file(Starting from 1) Sourcecol_end – sourcecol-sourcecol_end attributes are generated with the same properties. Valuetype– one out of nominal,numeric, integer, real, ordered, binominal, polynominal and file_path Blocktype – one out of single_value, value_series, value_series_start, value_series_end, interval, interval_start, interval_end
Model files (.mod files) Contains the models generated by previous runs MODELWRITER – to write model files MODELLOADER – to read model files MODELAPPLIER – to apply model files Attribute construction files (.att files) ATTRIBUTECONSTRUCTIONWRITER – writes an attribute set ATTRIBUTECONSTRUCTIONLOADER – reads an attribute set Parameter set files (.par files) GRIDPARAMETEROPTIMIZTION – generates a set of optimal parameters for a particular task PARAMETERSETLOADER – use the parameter files Attribute weight files (.wgt files) Attibute selection is seen as attribute weighing which allows for more flexibility ATTRIBUTEWEIGHTSWRITER – to write attribute weights to a file ATTRIBUTEWEIGHTSLOADER – to read the attribute weights ATTRIBUTEWEIGHTSAPPLIER – to apply in the example sets