• Save
MS SQL SERVER: Microsoft sequence clustering and association rules
Upcoming SlideShare
Loading in...5
×
 

MS SQL SERVER: Microsoft sequence clustering and association rules

on

  • 3,383 views

MS SQL SERVER: Microsoft sequence clustering and association rules

MS SQL SERVER: Microsoft sequence clustering and association rules

Statistics

Views

Total Views
3,383
Views on SlideShare
3,374
Embed Views
9

Actions

Likes
1
Downloads
0
Comments
0

3 Embeds 9

http://www.dataminingtools.net 4
http://dataminingtools.net 4
http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

MS SQL SERVER: Microsoft sequence clustering and association rules MS SQL SERVER: Microsoft sequence clustering and association rules Presentation Transcript

  • Microsoft Sequence ClusteringAnd Association Rules
  • OVERVIEW
    Introduction
    DMX Queries
    Interpreting the sequence clustering model
    Microsoft Sequence Clustering Algorithm Principles and Parameters
    Markov chain model
    Introduction to Microsoft Association Rules
    Association Algorithm Principles and Parameters
  • Microsoft Sequence ClusteringAnd Association Rules
    The Microsoft Sequence Clustering algorithm is a sequence analysis algorithm provided by Microsoft SQL Server Analysis Services.
    The algorithm finds the most common sequences by grouping, or clustering, sequences that are identical.
    Ex : Data that describes the click paths that are created when users navigate or browse a Web site.
    Data that describes the order in which a customer adds items to a shopping cart at an online retailer.
    View slide
  • DMX Queries
    By querying the data mining schema rowset, you can
    find various kinds of information about the model such as:
    Basic metadata,
    The date and time that the model was created and last processed,
    The name of the mining structure that the model is based on,
    The column used as the predictable attribute.
    View slide
  • DMX Queries
    SELECT MINING_PARAMETERS
    from
    $system.DMSCHEMA_MINING_MODELS WHERE MODEL_NAME = 'Sequence Clustering'
    Query to return the parameters that were used to build and train the Sample model.
  • DMX Queries
    SELECT FLATTENED NODE_UNIQUE_NAME, (SELECT ATTRIBUTE_VALUE AS [Product 1], [Support] AS [Sequence Support], [Probability] AS [Sequence Probability]
    FROM NODE_DISTRIBUTION) AS t FROM [Sequence Clustering].CONTENT WHERE NODE_TYPE = 13 AND [PARENT_UNIQUE_NAME] = 0
    Getting a List of Sequences for a State
    Query to return the complete list of first states in the model, before the sequences are grouped into clusters.
    Returning the list of sequences (NODE_TYPE = 13) that have the model root node as parent (PARENT_UNIQUE_NAME = 0).
    The FLATTENED keyword makes the results easier to read.
    Sample result of this query is shown in the next figure.
  • DMX Queries
    you reference the value returned for NODE_UNIQUE_NAME to get the ID of the node that contains all sequences for the model.
    You pass this value to the query as the ID of the parent node, to get only the transitions included in this node, which happens to contain a list of al sequences for the model.
  • Interpreting the sequence clustering model
    A sequence clustering model has a single parent node that represents the model and its metadata.
    The parent node, which is labeled, has a related sequence node that lists all the transitions that were detected in the training data.
    The algorithm also creates a number of clusters, based on the transitions that were found in the data and any other input attributes included when creating the model.
    Each cluster contains its own sequence node that lists only the transitions that were used in generating that specific cluster.
  • Interpreting the sequence clustering model
  • Microsoft Sequence Clustering Algorithm Principles
    The Microsoft Sequence Clustering algorithm is a hybrid algorithm that combines clustering techniques with Markov chain analysis to identify clusters and their sequences.
    This data typically represents a series of events or transitions between states in a dataset.
    The algorithm examines all transition probabilities and measures the differences, or distances, between all the possible sequences in the dataset to determine which sequences are the best to use as inputs for clustering.
    After the algorithm has created the list of candidate sequences, it uses the sequence information as an input for the EM method of clustering.
  • Markov chain model
    A Markov chain also contains a matrix of transition probabilities.
    The transitions emanating from a given state define a distribution over the possible next states.
    The equation P (xi= G|xi-1=A) = 0.15 means that, given the current state A, the probability of the next state being G is 0.15.
  • Markov chain model
    Based on the Markov chain, for any given length L sequence x {x1, x2,x3,. . .,xL},
    you can calculate the probability of a sequence as follows:
    P(x) = P(xL . xL-1,. . .,x1)
    = P(xL| xL-1,. . .,x1)P (xL-1|xL-2,. . .,x1).. .P(x1)
    In first-order, the probability of each state xi depends only on the state of xi-1.
    P(x) = P(xL . xL-1,. . .,x1)
    = P(xL|xL-1)P(xL-1|xL-2). . .P(x2|x1)P(x1)
  • Microsoft Sequence Clustering Parameters
    • CLUSTER_COUNTspecifies the approximate number of clusters to be built by the algorithm.
    Setting the CLUSTER_COUNT parameter to 0 causes the algorithm to use heuristics to best determine the number of clusters to build.
    The default is 10.
    • MAXIMUM_STATESspecifies the maximum number of states for a non-sequence attribute that the algorithm supports.
    The default is 100.
  • Microsoft Sequence Clustering Parameters
    • MINIMUM_SUPPORTspecifies the minimum number of cases that is required in support of an attribute to create a cluster.
    The default is 10.
    • MAXIMUM_SEQUENCE_STATES specifies the maximum number of states that a sequence can have.
    The default is 64.
  • Introduction to Microsoft Association Rules
    The Microsoft Association Rules Viewer in Microsoft SQL Server Analysis Services displays mining models that are built with the Microsoft Association algorithm.
    The Microsoft Association algorithm is an association algorithm provided by Analysis Services that is useful for recommendation engines.
    A recommendation engine recommends products to customers based on items they have already bought, or in which they have indicated an interest.
    The Microsoft Association algorithm is also useful for market basket analysis.
  • Structure of an Association Model
    The top level has a single node (Model Root) that represents the model.
    The second level contains nodes that represent qualified item sets and rules.
  • Association Algorithm Principles
    The Microsoft Association Rules algorithm belongs to the Apriori association family.
    The two steps in the Microsoft Association Rules algorithm are:
    • calculation-intensive phase, is to find frequent item sets.
    • Generate association rules based on frequent item sets.
  • Association Algorithm Parameters
    MINIMUM_SUPPORT is the minimum support found for a frequent itemset.
    Its value is within the range of 0 to 1.
    MAXIMUM_SUPPORT is the maximum support found for a frequent itemset.
    Its value is within the range of 0 to 1.
    The default value is 0.03.
  • Association Algorithm Parameters
    MINIMUM_PROBABILITY is a threshold parameter.
    It defines the minimum probability for an association rule.
    Its value is within the range of 0 to 1.
    The default value is 0.4.
    MINIMUM_IMPORTANCE is a threshold parameter for association rules.
    Rules with importance less than Minimum_Importance are filtered out.
  • Association Algorithm Parameters
    MAXIMUM_ITEMSET_SIZE specifies the maximum size of an itemset.
    The default value is 0, which means that there is no size limit on the itemset.
    MINIMUM_ITEMSET_SIZE specifies the minimum size of the itemset.
    The default value is 0.
    MAXIMUM_ITEMSET_COUNTdefines the maximum number of item sets.
  • Association Algorithm Parameters
    OPTIMIZED_PREDICTION_COUNTdefines the number of items to be cached to optimized predictions
    AUTODETECT_MINIMUM_SUPPORTrepresents the sensitivity of the algorithm used to autodetect minimum support.
    To automatically detect the smallest appropriate value of minimum support, Set this value to 1.0 .
    To turns off autodetection, Set this value to 1.0
  • Summary
    Introduction to sequence clustering
    DMX Queries
    The sequence clustering model
    Microsoft Sequence Clustering Algorithm Principles and Parameters
    Markov chain model
    Introduction to Microsoft Association Rules
    Association Algorithm Principles and Parameters
  • Visit more self help tutorials
    Pick a tutorial of your choice and browse through it at your own pace.
    The tutorials section is free, self-guiding and will not involve any additional support.
    Visit us at www.dataminingtools.net