Data Mining Tools
Kowshik
Madhumati
Mayur
Mohamed Sharique
Vidyashankar
• Open source
• Data visualization and analysis
• Novice and experts
• Through Python scripting
• Available for all popular platforms, including
Windows, Mac OS X and variants of Linux.
• Founded on 1996
• Orange is distributed free under the GPL.
• M&D at the Bioinformatics Laboratory of the
Faculty of Computer and Information
Science, University of Ljubljana, Slovenia.
Product Details
Company Details
Python is a widely used general-purpose, high-level programming language.
GNU General Public License is the most widely used free software license
Features
• Visual Programming
• Visualization
• Interaction and Data Analytics
• Large Toolbox
• Scripting Interface
• Extendable
• Documentation
• Open Source
• Platform Independence
Success Stories
• Astra-Zeneca, a pharmaceutical giant, which uses
Orange in drug development and sponsors the
development of several related parts of Orange
• At Jožef Stefan Institute, the visual programming
interface has been upgraded in Orange4WS to
support service-oriented architectures
Screenshot
• Latest R-language engine for statistical computing
• Open source, R- Enterprise, R-Cloud(Paid version )
• Data visualization and analysis up to 16 TB
• Extended capabilities with reproducible R tool Kits
• Windows , Mac OS and variants of Linux.
• Founded on 1993 in New Zealand
• Robert and Rossa pioneer in R language
development .
• R has General Public Licence.
• Many Big MNC companies are using R software.
Product Details
Company Details
Useful Functions • Graphics Visualization
• Spatial Data Analysis
• Clustering
• Text Mining
• Social Network Analysis and Graph mining
• Statistics
• Graphics
• Data Manipulation
Success Stories
• Bank of America
• Bing
• Facebook
• Ford
• Google
Screenshot
• Open source
• a collection of machine learning algorithms
• Data visualization and analysis
• Java based platform
• Most researchers and practitioners
• Founded on 1997
• University of Waikato
Product Details
Company Details
Public License is the most widely used free software license
Features • General public license
• GUI for interacting
• Explorer is the main user interface of WEKA
• primitive tasks including data pre-processing,
classification, regression, clustering, association rules
and visualization
• Execute data files in multiple format
• One exceptional feature of WEKA is the database
connection using JDBC with any RDBMS package
• The Weka mailing list has over 1100
subscribers in 50 countries, including
subscribers from many major companies
such as Rechtsportal
Success Stories
Screenshot
• Open source.
• Data visualization and analysis
• Machine Learning
• Data Mining, Text Mining.
• Business Intelligence.
• Works on java runtime.
• Available on all major operating systems and
platforms
• Started as YALE in 2001 by Ralf Klinkenberg, Ingo
Mierswa, and Simon Fische
• In 2006 it was renamed by Rapidminer since
developed by Rapid-1 founded by Ralf
Klinkenberg, Ingo Mierswa
• Licensed by AGPL.
Product Details
Company Details
Features • A visual - code-free - environment, so no programming needed
• Design of analysis processes
• Predictive analytics (with pre-made templates)
• Data loading
• Data transformation
• Data Modelling
• Data visualization (with lots of visualizations)
• Allows you to work with different types and sizes of data sources
• Platform Independence.
• Acts as a powerful scripting language engine along with a
graphical user
• Modular operator concept.
• Multi-layered data view.
• CISCO
• PAYPAL
• EBAY
• MIELE
• VOLKSWAGEN
Success Stories
Screenshot
Procedure R-Programming RapidMiner Weka Orange
Partitioning of
dataset into training
and testing sets.
Pass (but limited
partitioning
methods)
Pass (but limited
partitioning
methods)
Pass (but limited
partitioning
methods)
Pass (but limited
partitioning
methods)
Descriptor scaling Pass Pass
Fail (cannot save
parameters for
scaling to apply to
future datasets)
Fail (no scaling
methods)
Descriptor selection
Fail (no wrapper
methods)
Pass
Pass (but is not part
of KnowledgeFlow)
Fail (no wrapper
methods)
Parameter
optimization of
machine
learning/statistical
methods
Fail (not automatic) Pass Fail (not automatic) Fail (not automatic)
Model validation
using cross-
validation and/or
independent
validation set
Pass (but limited
error measurement
methods)
Pass
Pass (but cannot
save model so have
to rebuild model for
every future dataset)
Pass (but cannot
save model so have
to rebuild model for
every future dataset)
Overall Comparison
• http://old.biolab.si/
• http://en.wikipedia.org/
• http://www.predictiveanalyticsto
day.com/
• http://thenewstack.io/
• www.facebook.com/
• www.slideshare.net/
• www.kdnuggets.com/
• www.researchgate.net
• https://rapidminer.com/
• www.r-project.org
• sourceforge.net/projects/weka
• www.thearling.com

Data mining tools (R , WEKA, RAPID MINER, ORANGE)

  • 1.
  • 2.
    • Open source •Data visualization and analysis • Novice and experts • Through Python scripting • Available for all popular platforms, including Windows, Mac OS X and variants of Linux. • Founded on 1996 • Orange is distributed free under the GPL. • M&D at the Bioinformatics Laboratory of the Faculty of Computer and Information Science, University of Ljubljana, Slovenia. Product Details Company Details Python is a widely used general-purpose, high-level programming language. GNU General Public License is the most widely used free software license
  • 3.
    Features • Visual Programming •Visualization • Interaction and Data Analytics • Large Toolbox • Scripting Interface • Extendable • Documentation • Open Source • Platform Independence Success Stories • Astra-Zeneca, a pharmaceutical giant, which uses Orange in drug development and sponsors the development of several related parts of Orange • At Jožef Stefan Institute, the visual programming interface has been upgraded in Orange4WS to support service-oriented architectures
  • 4.
  • 5.
    • Latest R-languageengine for statistical computing • Open source, R- Enterprise, R-Cloud(Paid version ) • Data visualization and analysis up to 16 TB • Extended capabilities with reproducible R tool Kits • Windows , Mac OS and variants of Linux. • Founded on 1993 in New Zealand • Robert and Rossa pioneer in R language development . • R has General Public Licence. • Many Big MNC companies are using R software. Product Details Company Details
  • 6.
    Useful Functions •Graphics Visualization • Spatial Data Analysis • Clustering • Text Mining • Social Network Analysis and Graph mining • Statistics • Graphics • Data Manipulation Success Stories • Bank of America • Bing • Facebook • Ford • Google
  • 7.
  • 8.
    • Open source •a collection of machine learning algorithms • Data visualization and analysis • Java based platform • Most researchers and practitioners • Founded on 1997 • University of Waikato Product Details Company Details Public License is the most widely used free software license
  • 9.
    Features • Generalpublic license • GUI for interacting • Explorer is the main user interface of WEKA • primitive tasks including data pre-processing, classification, regression, clustering, association rules and visualization • Execute data files in multiple format • One exceptional feature of WEKA is the database connection using JDBC with any RDBMS package • The Weka mailing list has over 1100 subscribers in 50 countries, including subscribers from many major companies such as Rechtsportal Success Stories
  • 10.
  • 11.
    • Open source. •Data visualization and analysis • Machine Learning • Data Mining, Text Mining. • Business Intelligence. • Works on java runtime. • Available on all major operating systems and platforms • Started as YALE in 2001 by Ralf Klinkenberg, Ingo Mierswa, and Simon Fische • In 2006 it was renamed by Rapidminer since developed by Rapid-1 founded by Ralf Klinkenberg, Ingo Mierswa • Licensed by AGPL. Product Details Company Details
  • 12.
    Features • Avisual - code-free - environment, so no programming needed • Design of analysis processes • Predictive analytics (with pre-made templates) • Data loading • Data transformation • Data Modelling • Data visualization (with lots of visualizations) • Allows you to work with different types and sizes of data sources • Platform Independence. • Acts as a powerful scripting language engine along with a graphical user • Modular operator concept. • Multi-layered data view. • CISCO • PAYPAL • EBAY • MIELE • VOLKSWAGEN Success Stories
  • 13.
  • 14.
    Procedure R-Programming RapidMinerWeka Orange Partitioning of dataset into training and testing sets. Pass (but limited partitioning methods) Pass (but limited partitioning methods) Pass (but limited partitioning methods) Pass (but limited partitioning methods) Descriptor scaling Pass Pass Fail (cannot save parameters for scaling to apply to future datasets) Fail (no scaling methods) Descriptor selection Fail (no wrapper methods) Pass Pass (but is not part of KnowledgeFlow) Fail (no wrapper methods) Parameter optimization of machine learning/statistical methods Fail (not automatic) Pass Fail (not automatic) Fail (not automatic) Model validation using cross- validation and/or independent validation set Pass (but limited error measurement methods) Pass Pass (but cannot save model so have to rebuild model for every future dataset) Pass (but cannot save model so have to rebuild model for every future dataset) Overall Comparison
  • 15.
    • http://old.biolab.si/ • http://en.wikipedia.org/ •http://www.predictiveanalyticsto day.com/ • http://thenewstack.io/ • www.facebook.com/ • www.slideshare.net/ • www.kdnuggets.com/ • www.researchgate.net • https://rapidminer.com/ • www.r-project.org • sourceforge.net/projects/weka • www.thearling.com

Editor's Notes

  • #10 contains a GUI for interacting with data files and producing visual results
  • #11 Explorer has several panels providing access to the main components of the workbench: the Preprocess panel has facilities for importing data from a database, a CSV file, etc, and to preprocess this data using a filtering algorithm. Such filters can be used to transform the data and make it possible to delete instances and attributes as per specific criteria. The Classify panel provides the features to apply classification and regression algorithms to the dataset, to estimate the accuracy of the resulting predictive model and visualise erroneous predictions, ROC curves or the model. The Associate panel provides the access for association rule learning to identify the interrelationships between attributes in the data. The Cluster panel or module provides access to the clustering techniques, including simple k-means algorithm and many others. The Select attributes panel provides access to the algorithms for the identification of the most predictive attributes in a dataset. The Visualize panel depicts a scatter plot matrix in which individual scatter plots can be selected, enlarged and analysed using various selection operators.