Business intelligence and data warehousing

BUSINESS INTELLIGENCE
AND DATA WAREHOUSING.
Presented by:
Vaishnavi Chigarapalle.

Agenda:
• ID3 Algorithm.
• WEKA.
• Web Mining Applications for Business.
• References.

Overview:
• What is ID3 ?
• Decision Trees.
• Simple example of Decision Trees.
• ID3 Algorithm.
• Problem.
• Solution to the discussed problem.
• Conclusion.

What is ID3 ?
• ID3 Stands for Iterative Dichotomiser 3.
• This is a mathematical algorithm for building Decision Trees from a
dataset.
• Invented by J . Ross Quinlan in 1979.
• Uses Information Theory invented by Shannon in 1948.
• The algorithm attempts to create smallest possible decision tree from
top down, with no backtracking.
• ID3 is the precursor to the C4.5 algorithm.
• This is typically used in machine learning and Natural Language
Processing Domains.

Decision trees
• The tree consists of decision nodes and leaf nodes.
• A decision node has two or more branches, each representing values for the
attribute set.
• A leaf node attribute produces a homogeneous result, which does not
require additional classification testing.
• Decision trees are produced by algorithms that identify various ways of
splitting a data set into branch-like segments.
• These segments form an inverted decision tree that originates with a root
node at the top of the tree.

Simple Example of a Decision Tree.

ID3 Algorithm
• First step involves creating a root node for the tree.
• If all the examples turn out to be containing positive values then return the
single-node r=tree root, with label „+‟.
• If all the examples turn out to be containing negative values then return the
single-node root, with label „-„.
• If the number of predicting attributes is empty, then return the single node
tree root, with label being the most common value of the target attribute.
• Else
A = Attribute that best classifies examples.
Decision tree attribute for root that equals to A.
For each possible value, vi, of A,
 Add a new tree branch below root, corresponding to the test A = vi.

ID3 Algorithm
 Let examples (vi), be the subset of examples that have the value vi
for A.
 If examples (vi) is empty
 Then below this new branch add a leaf node with label equal to most
common target value in the examples.
– Else below this new branch add the subtree ID3 (Examples
(vi), Target_Attribute, Attributes-{A}).
• End
• Return Root.

Conclusion
• ID3 attempts to make the shortest decision tree out of a set of learning
data, shortest is not always the best classification.
• Requires learning data to have completely consistent patterns with no
uncertainty.

Overview
• What is WEKA ?
• WEKA GUI Chooser.
• Data Mining with WEKA.
• Problem.
• Solution for the discussed problem.
• Conclusion

What is WEKA ?
• WEKA is an acronym for Waikato Analysis for Knowledge Analysis.
• This is a popular suite of machine learning software written in Java.
• This is developed at University of Waikato, New Zealand.
• WEKA is portable, since it is fully implemented in the Java programming
language and thus runs on almost any modern computing platform.
• WEKA is free software available under the GNU General Public License.
• WEKA‟s applications:
 Explorer.
 Knowledge Flow.
 Experimenter.
 Simple CLI.

Data Mining With WEKA
Input
•Raw data
Data Mining by WEKA
•Pre-processing
•Classification
•Regression
•Clustering
•Association Rules
•Visualization
Output
•Result

Explorer
• Explorer is WEKA‟s main user interface.
• The Explorer interface features several panels providing access to the main
component of the work bench :
 Preprocess.
 Classify
 Associate
 Cluster
 Select Attributes
 Visualize.
• Preprocess Panel: This can be used to transform the data and make it
possible to delete the instances and attributes according to specific criteria.
• Classify Panel: Enables the users to apply classification and regression
algorithms to resulting dataset, to estimate accuracy of the resulting
predictive model.

• Associate Panel: This provides access to association rule learners that
attempt to identify all important interrelationships between attributes in the
data.
• Cluster Panel: This gives access to the clustering techniques in WEKA.
• Select Panel: This panel provides algorithms for identifying the most
predictive attributes in a dataset.
• Visualize Panel: This panel shows a scatter plot matrix, where individual
scatter plots can be selected and enlarged, and analyzed further using
various selection operators.

Experimenter
• This allows the systematic comparison of the predictive performance of
WEKA‟s machine learning algorithms on a collection of datasets.
• Experimenter also allows us to set large-scale experiments, start them
running, leave them, and they analyze the performance statistics that have
been collected.
• They automate the experimental process.
• The statistics can be stored in ARFF format.
• It allows users to distribute the computing load across multiple machines
using Java RMI.

Knowledge Flow
• The Knowledge Flow provides an alternative to the Explorer as a graphical
front end to WEKA‟s core algorithms.
• The Knowledge Flow presents a data-flow inspired interface to WEKA.
• The user can select WEKA components from a tool bar, place them on a
layout canvas and connect them together in order to form a knowledge for
Flow processing and analyzing data.
• Unlike the Explorer the Knowledge Flow can handle data either
incrementally or in batches.

Simple CLI
• Simple CLI provides a command line mode to access WEKA.

Conclusion
• In sum, the overall goal of WEKA is to build a state-of-the-art facility for
developing machine learning (ML) techniques and allow people to apply
them to real-world data mining problems.
• Detailed documentation about different functions provided by WEKA can
be found on WEKA website.

Overview
• What is Web mining ?
• Challenges related to web mining.
• Web mining applications.
• Problems with Web search.
• Improvised search – adding structure to the web.
• Conclusion.

What is Web Mining ?
• Web mining is the use of data mining techniques to automatically discover
and extract information from web documents / services.
• Discovering useful information from the World-wide Web and its usage
patterns.
• Web mining can be divided into three different type:
 Web usage mining.
 Web Content mining.
 Web structure mining.

Challenges related to Web Mining
• The web is a huge collection of documents except for the following:
 Hyperlink information
 Access and usage information.
• The web is very dynamic, new pages are constantly being generated.
• Challenge: The main challenge is to develop new web mining algorithms
and adapt traditional data mining algorithms to exploit hyperlinks and
access patterns.

Web Mining Applications
• E-Commerce (Infrastructure)
 Generate User profiles.
 Internet Advertising.
 Fraud.
 Similar Image Retrieval.
• Information retrieval (search) on web
 Automatic generation of topic hierarchies.
 Web Knowledge bases.
 Extraction of schema for XML documents.
• Network Management
 Performance Management.
 Fault Management.

User Profiling.
• Important for improving customization:
 Provides users with pages, advertisements of interest.
 Example profiles: on-line trader, on-line shopper.
• Generate user profiles based on their access patterns
 Cluster users based on frequently accessed URLs
 Use classifier to generate a profile for each cluster.

Internet Advertising.
• Scheme 1:
 Manually associate a set of ads with each user profile.
 For each user, display an ad from the set based on profile.
• Scheme 2:
 Automate association between ads and users.
 Use ad click information to cluster users.
 For each cluster, find ads that occur most frequently in the cluster and these
become the ads for the set of users in the cluster.

Fraud
• With the growing popularity of E-commerce, systems to detect and prevent
fraud on the web become important.
• Maintain a signature for each user based on buying patterns on the web.
• If buying pattern changes significantly, then signal fraud.
• HNC software uses domain knowledge and neural networks for credit card
fraud detection.

Image Retrieval System
• Given:
 A set of images
• Find:
 All images similar to a given image.
 All pairs of similar images.
• Few applications of the image retrieval system are :
 Medical diagnosis.
 Weather Prediction
 Web search engine for images.
 E-commerce.

Problems with Web Search
• Today‟s search engine are plagued by many problems and few of them are
as mentioned below:
 The “abundance” problem.
 “Limited coverage” of the web.
(largest crawlers cover less than 18% of all the web pages.
 “Limited Query” interface based on keyword-oriented search.
 “Limited customization” to individual users.
 Web is “highly dynamic”.

Improvised searching – Adding
structure to the web

Conclusion
• Web mining systems needs to be implemented to:
 Understand visitor‟s profiles.
 Identify company‟s strength and weaknesses.
 Measure the effectiveness of online marketing efforts.
• Web mining support on-going continuous improvements for E-businesses.

References
• http://www.slideshare.net/dataminingtools/WEKA-the-experimenter
• http://www.cs.waikato.ac.nz/ml/WEKA/arff.html
• http://en.wikipedia.org/wiki/WEKA_(machine_learning)
• http://www.cs.umd.edu/Grad/scholarlypapers/papers/Bahety.pdf
• http://software.ucv.ro/~eganea/AIR/KnowledgeFlowTutorial-3-5-8.pdf

Business intelligence and data warehousing

Business intelligence and data warehousing

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to Business intelligence and data warehousing

Similar to Business intelligence and data warehousing (20)

More from Vaishnavi

More from Vaishnavi (9)

Recently uploaded

Recently uploaded (20)

Business intelligence and data warehousing