Data Mining with SQL Server 2008
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Data Mining with SQL Server 2008

  • 2,455 views
Uploaded on

What is Data Mining? ...

What is Data Mining?
Why?
Discovering relationships
Predict future events
Usage scenarios
Algorithms

More in: Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
2,455
On Slideshare
2,454
From Embeds
1
Number of Embeds
1

Actions

Shares
Downloads
65
Comments
0
Likes
0

Embeds 1

http://www.linkedin.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Click to add notesPeter Gfader shows SQL Server
  • Java current version 1.6 Update 211.7 released next year 2010Dynamic languages Parallel computingMaybe closures
  • 3. Create the following report on top of Northwind Top 10 customers (Table) Top 10 products (Table) Top 10 employees (Table) 1 chart that shows the top 10 customers 1 usage of the gauge control (surprise me)a. Download Report builder 2 from http://www.microsoft.com/downloads/en/details.aspx?FamilyID=9f783224-9871-4eea-b1d5-f3140a253db6&displaylang=enb. Send me the screenshot of the final report
  • Data mining can be used to uncover patterns in data samples, it is important to be aware that the use of non-representative samples of data may produce results that are not indicative of the domainSimilarly, data mining will not find patterns that may be present in the domain, if those patterns are not present in the sample being "mined". There is a tendency for insufficiently knowledgeable "consumers" of the results to attribute "magical abilities" to data mining, treating the technique as a sort of all-seeing crystal ball. Like any other tool, it only functions in conjunction with the appropriate raw material: in this case, indicative and representative data that the user must first collect. Further, the discovery of a particular pattern in a particular set of data does not necessarily mean that pattern is representative of the whole population from which that data was drawn. Hence, an important part of the process is the verification and validation of patterns on other samples of data.
  • Data mining can be used to uncover patterns in data samples, it is important to be aware that the use of non-representative samples of data may produce results that are not indicative of the domain Similarly, data mining will not find patterns that may be present in the domain, if those patterns are not present in the sample being "mined".  There is a tendency for insufficiently knowledgeable "consumers" of the results to attribute "magical abilities" to data mining, treating the technique as a sort of all-seeing crystal ball. Like any other tool, it only functions in conjunction with the appropriate raw material: in this case, indicative and representative data that the user must first collect.  Further, the discovery of a particular pattern in a particular set of data does not necessarily mean that pattern is representative of the whole population from which that data was drawn. Hence, an important part of the process is the verification and validation of patterns on other samples of data. 
  • http://msdn.microsoft.com/en-us/library/ms175595.aspxWays to analyze your dataDT = split dataEach of branch is like an attributeBrightness = amount of dataTODO: Check out barsClustering = mapping of popular pointsNumber of childrenDarkness = Lines are links between clusters (associations)Time seriesTimebased data  predictionSequence clusteringNumbers orders stronger associationsDirection of association (not necessary the other direction)AssociationIf you own certain stocks  you own maybe other ones as wellProbability = thickness of lineNaive BayesBayes FormulaUses statistics to say falls into certain category or not (with probabiblty)Spam filtering  score of spam (bayes)Testing only a particular attributeNeural NetsLet system learn how to classify dataFormulate statement/hypothesisOutcome is know(Data / Surveys)1. 70% data to train network (outcome is known)2. 30% of data to test network (outcome is known)3. New data (no survey needed, predict from network)Ex: OCR Example above = get loyalty of customersNeural Network adapts to the new data
  • What attributes I am interested inAlgorithm splits data for me
  • Pruned = gestutzt
  • Diff. Color = relationshipUser clicked on toy story2
  • Very easy to setupClassifies and gives a score  prediction
  • Class label:Combination of diff. AttributesName clusters yourself
  • Diff. Color = relationshipUser clicked on toy story2
  • Diff. Color = relationshipUser clicked on toy story2
  • Get loyalty of customers
  • Click to add notesPeter Gfader shows SQL Server

Transcript

  • 1. SQL Server 2008 for Business Intelligence
    UTS Short Course
  • 2. Peter Gfader
    Specializes in
    C# and .NET (Java not anymore)
    TestingAutomated tests
    Agile, ScrumCertified Scrum Trainer
    Technology aficionado
    Silverlight
    ASP.NET
    Windows Forms
  • 3. Admin Stuff
    Attendance
    You initial sheet
    Hands On Lab
    You get me to initial sheet
    Homework
    Certificate
    At end of 5 sessions
    If I say if you have completed successfully 
  • 4. Course Website
    Course Timetable & Materials
    http://www.ssw.com.au/ssw/Events/2010UTSSQL/
    Resources
    http://sharepoint.ssw.com.au/Training/UTSSQL/
  • 5. Course Overview
  • 6. Last week(s)
    Other cube browsers
    Microsoft Data Analyzer
    Proclarity
    Excel 2003/2007/2010
    Excel services
    Thinslicer
    Performance Point
    Power Pivot
  • 7. Create report on top of Northwind
    Top 10 customers (Table)
    Top 10 products (Table)
    Top 10 employees (Table)
    1 chart that shows the top 10 customers
    1 usage of the gauge control (surprise me)
    Homework
  • 8. The plan
  • 9. Step by step to BI
    Create Data Warehouse
    Copy data to data warehouse
    Create OLAP Cubes
    Create Reports
    Browse the cube
    Do some Data Mining
    Discovering relationships
    Predict future events
  • 10. Agenda
    What is Data Mining?
    Why?
    Uses
    Algorithms
    Demo
    Hands on Lab
  • 11. What is Data Mining?
    “Data mining is the use of powerful software tools to discover significant traits or relationships,from databases or data warehouses and often used to predict future events”
  • 12. What is Data Mining?
    It exploits statistical algorithms
    Once the “knowledge” is extracted it:
    Can be used to discover
    Can be used to predict values of other cases
  • 13. Why Data Mining?
    Marketing
    Who picks the movie? The kids, the wife, me
    Who are our Customers and what sort of films do they hire?
    Is a 30 year old woman with 2 children going to hire Arnie’s latest film
    Validation
    Is this data sensible? Terminator 2 and Toy Story
    Prediction
    Sales Next Year
  • 14. Get new information from data, future trends, past trends, outlier, maximums, minimums
    Analyse data from different perspectives and summarizing it into useful information
    New information to
    increase revenue
    cuts costs
    or both :-)
    Why? Its all about money
  • 15. Who are our biggest customers?
    What are customers buying with cigars?
    What are the customer retention levels of our branches?
    Which customers have bought olives, feta cheese but no ciabatta bread?
    Which regions have the highest male/female ratio of single 20 somethings?
    Which region has lowest customer retention levels and list out lost customers?
    Which Questions are Data Mining?
  • 16. Ad hoc query
    Drill through to details
    Business Intelligence tool
    What’s not data mining
  • 17.
    • Huge amount of data
    • 18. Good raw material  good data mining
    Samples should be representative
    Samples "similar" to domain
    Not all-seeing crystal ball
    Verify and Validate!
    Data - Uncover patterns in samples
  • 19. OLAP
    Is about fast ad hoc querying
    Analysis by dimensions and measures
    Gives precise answers
    Data Mining
    May use RDBMS or OLAP source
    Is about discovering and predicting
    Gives imprecise answers
    OLAP is not a prerequisite for data mining, but it almost always comes first
    OLAP versus Data Mining
    (learning to ride a bike before a car)
  • 20. Classification algorithms
    predictone or more discrete variables, based on the other attributes in the dataset
    Regression algorithms
    predictone or more continuous variables, such as profit or loss, based on other attributes in the dataset
    Segmentation algorithms
    dividedata into groups, or clusters, of items that have similar properties
    Association algorithms
    find correlations between different attributes in a dataset
    Sequence analysis algorithms
    summarize frequent sequences or episodes in data, such as a Web path flow
    Types of Data Mining Algorithms
  • 21. Clustering
    Time Series
    Decision Trees
    Naïve Bayes
    Association
    Linear Regression
    Complete Set Of AlgorithmsWays to analyze your data
    Neural Network
    Sequence Clustering
    Logistic Regression
  • 22. Split data
    Each of branch is like an attribute
    Brightness = amount of data
    Decision trees
  • 23. Decision Trees (1)
    Decision Trees assign (classify) each case to one of a few (discrete) broad categories of selected attribute (variable) and explains the classification with few selected input variables
    The process of building is recursive partitioning – splitting data into partitions and then splitting it up more
    Initially all cases are in one big box
  • 24. Decision Trees (2)
    The algorithm tries all possible breaks in classes using all possible values of each input attribute; it then selects the split that partitions data to the purest classes of the searched variable
    Several measures of purity
    Then it repeats splitting for each new class
    Again testing all possible breaks
    Unuseful branches of the tree can be pre-pruned or post-pruned
  • 25. Decision Trees (3)
    Decision trees are used for classification and prediction
    Typical questions:
    Predict which customers will leave
    Help in mailing and promotion campaigns
    Explain reasons for a decision
    What are the movies young female customers like to buy?
  • 26. Decision Trees – Who Decides
  • 27. Naïve Bayes
    Bayes Formula
    Uses statistics to say falls into certain category or not with probability
    Spam filtering: score of spam (Bayes)
    Testing only a particular attribute
  • 28. Naïve Bayes
    Quickly builds mining models that can be used for classification and prediction
    It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute
    This can later be used to predict an outcome of the predicted attribute based on the known input attributes
    This makes the model a good option for exploring the data
  • 29. Cluster Analysis (1)
    Grouping data into clusters
    Objects within a cluster have high similarity based on the attribute values
    The class label of each object is not known
    Several techniques
    Partitioning methods
    Hierarchical methods
    Density based methods
    Model based methods
    And more…
  • 30. Cluster Analysis (2)
    Segments a heterogeneous population into a number of more homogenous subgroups or clusters
    Some typical questions:
    Discover distinct groups of customers
    Identification of groups of houses in a city
    In biology, derive animal and plant taxonomies
    Find outliers
  • 31. Clustering
    Annual
    Income
    Age
  • 32. Time series
    Timebaseddata  prediction
  • 33. Sequence clustering
    Numbers orders stronger associations
    Direction of association (not necessary the other direction)
  • 34. If you own certain stocks ' you own maybe other ones as well
    Probability = thickness of line
    Association
  • 35. Let system learn how to classify data
    Neural Network adapts to the new data
    Formulate statement/hypothesis
    Outcome is know
    (Data / Surveys)
    1. 70% data to train network (outcome is known)
    2. 30% of data to test network (outcome is known)
    3. New data (no survey needed, predict from network)
    Other example: OCR
    Neural Nets
  • 36. Both have directions
    Sequence Clustering has probability number and colour
    They are very similar. The difference is that Association analyses items that occur together whereas sequence clustering analyses items that follow one another.
    An example is that Sequence Clustering might be used by credit card companies to spot fraud, e.g. a petrol station refill followed by another petrol station refill followed by a big purchase = fraud (different transactions)
    Whereas Association will be more like: when someone buys popcorn at the cinemas, they also buy a drink (same transaction)
    Difference between algorithms: Association and Sequence
  • 37. Conclusion: When To Use What
  • 38. Visual Numerics
    3rd party algorithms
    http://www.vni.com/company/whitepapers/ MicrosoftBIwithNumericalLibraries.pdf
    There is more...
  • 39. Excel Data Mining
    Microsoft SQL Server 2008 Data Mining Add-ins for Microsoft Office 2007
    http://www.microsoft.com/downloads/en/details.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=en
  • 40. Train station / airport
    Who is the bad guy
    Farmers
    Find the best crops
    Supermarket
    Find to figure out how to get you to buy more, where the expensive items
    Other usages of data miningFind patterns - Profiling
  • 41. SSIS 2008 - Data profiling task
    Get a profile of the data in a table
    potential candidate keys
    length of data values in columns
    Null percentage of rows
    distribution of values
    ....
    Tip
  • 42. Video: Simple data mining model
    http://www.sqlservercentral.com/articles/Video/65055/
    Video: Data mining and Reporting Services
    http://www.sqlservercentral.com/articles/Video/64190/
    Data Mining Algorithms
    http://msdn.microsoft.com/en-us/library/ms175595.aspx
    Resources 1
  • 43. Jamie MacLennan
    http://blogs.msdn.com/b/jamiemac/
    Richard Lees on BI
    http://richardlees.blogspot.com/
    Book Data Mining with Microsoft SQL Server 2008
    http://www.amazon.com/gp/product/0470277742?ie=UTF8&tag=sqlserverda09-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742
    Resources 2
  • 44. Summary
    Why Data Mining?
    Uses
    Algorithms
    Demo
    Hands on Lab
  • 45. 3things…
    PeterGfader@ssw.com.au
    http://blog.gfader.com/
    twitter.com/peitor
  • 46. Thank You!
    Gateway Court Suite 10 81 - 91 Military Road Neutral Bay, Sydney NSW 2089 AUSTRALIA
    ABN: 21 069 371 900
    Phone: + 61 2 9953 3000 Fax: + 61 2 9953 3105
    info@ssw.com.auwww.ssw.com.au