Your SlideShare is downloading. ×
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • But even better ones needed—active research areas
  • But even better ones needed—active research areas
  • Transcript

    • 1. Data Stream Mining Applications: Toward Inductive DSMS CS240B Notes by Carlo Zaniolo UCLA Computer Science Department Spring 2008
    • 2. Data Stream Mining and DSMS
      • Mining Data Stream: an emerging area of important applications
      • Many fast & light algorithms developed for mining data streams: Ensembles, Moment, SWIM, etc .
      • Deployemnt of these algorithms on data streams a challenge
        • To deal with bursty arrivals, synopses, QoS, scheduling
      • Analysts want to focus on high-level mining tasks, leaving such lower-level issues to the DSMS
      • Integration of mining methods and DSMS technology is needed—but it faces difficult research challenges:
        • Data mining: a big problem for SQL-based DBMS
    • 3. Road Map for Next Three Weeks
      • Data Mining query languages and systems
        • The Inductive DBMS dream and the reality:
          • Oracle, IBM DB2, MS DMX, Weka
        • Fast& Light Algorithms for Mining Data Streams
        • Classifiers and Classifier Ensembles,
        • Clustering methods,
        • Association Rules,
        • Time series
      • Supporting these Algorithms in a DSMS
        • Data Mining Query Languages and support for the mining process
    • 4. The DM Experience for DBMS: from dreams to reality
      • Initial attempts to support mining queries in relational DBMS : Unsuccessful
        • OR-DBMS do not fare much better [Sarawagi’ 98].
      • In 1996, a ‘ high-road ’ approach was proposed by Imielinski & Mannila who called for a quantum leap in functionality based on:
          • High-level declarative languages for Data Mining (DM)
          • Technology breakthrough in DM query optimization.
      • The research area of Inductive DBMS was thus born
        • Inspiring significant work: DMQL , Mine Rule , MSQL , …
          • Suffer from limited generality and performance issues.
    • 5. DB2 Intelligent Miner
      • Model creation
      • Training:
        • CALL IDMMX.DM_buildClasModelCmd('IDMMX.CLASTASKS',
        • 'TASK', 'ID', 'HeartClasTask',
        • 'MODEL', 'MODELNAME', 'HeartClasModel' );
      • Prediction
      • Stored procedures and virtual mining views
      • Outside the DBMS (like Cache Mining)
        • Data transfer delays
      • /
    • 6. DB2 Intelligent Miner
      • Model creation
      • Training
        • CALL IDMMX.DM_buildClasModelCmd('IDMMX.CLASTASKS',
        • 'TASK', 'ID', 'HeartClasTask',
        • 'MODEL', 'MODELNAME', 'HeartClasModel' );
      • Prediction
      • Stored procedures and virtual mining views
      • Outside the DBMS (like Cache Mining)
        • Data transfer delays
      • /
    • 7. Oracle Data Miner
      • Algorithms
        • Adaptive Naïve Bayes
        • SVM regression
        • K-means clustering
        • Association rules, text, mining, etc.
      • PL/SQL with extensions for mining
      • Models as first class objects
        • Create_Model, Prediction, Prediction_Cost, Prediction_Details, etc.
    • 8. OLE DB for DM (DMX)
      • Model creation
        • Create mining model MemCard_Pred (
        • CustomerId long key, Age long continuous,
        • Profession text discrete,
        • Income long continuous,
        • Risk text discrete predict)
        • Using Microsoft_Decision_Tree;
      • Training
        • Insert into MemCard_Pred OpenRowSet(
        • “‘ sqloledb’, ‘sa’, ‘mypass’”,
        • ‘ SELECT CustomerId, Age,
        • Profession, Income, Risk from Customers’)
      • Prediction Join
        • Select C.Id, C.Risk, PredictProbability(MemCard_Pred.Risk)
        • From MemCard_Pred AS MP Prediction Join Customers AS C
        • Where MP.Profession = C.Profession and AP.Income = C.Income
        • AND MP.Age = C.Age;
    • 9. Defining a Mining Model
      • Define
        • The format of “training cases” (top-level entity)
        • Attributes, Input/output type, distribution
        • Algoritms and parameters
      • Example
        • CREATE MINING MODEL CollegePlanModel
        • ( StudentID LONG KEY,
          • Gender TEXT DISCRETE,
          • ParentIncome LONG NORMAL CONTINUOUS,
          • Encouragement TEXT DISCRETE,
          • CollegePlans TEXT DISCRETE PREDICT
        • ) USING Microsoft_Decision_Trees
    • 10.
      • INSERT INTO CollegePlanModel
      • (StudentID, Gender, ParentIncome, Encouragement, CollegePlans)
      • OPENROWSET(‘<provider>’, ‘<connection>’,
      • ‘ SELECT StudentID,
      • Gender,
      • ParentIncome,
      • Encouragement,
      • CollegePlans
      • FROM CollegePlansTrainData’)
    • 11.
      • SELECT t.ID, CPModel.Plan
      • ON CPModel.Gender = t.Gender AND
      • CPModel.IQ = t.IQ
      Prediction Join CPModel NewStudents IQ Gender ID Plan IQ Gender ID
    • 12. OLE DB for DM (DMX) (cont.)
      • Mining objects as first class objects
        • Schema rowsets
          • Mining_Models
          • Mining_Model_Content
          • Mining_Functions
      • Other features
        • Column value distribution
        • Nested cases
    • 13. Summary of Vendors’ Approaches
      • Built-in library of mining methods
        • Script language or GUI tools
      • Limitations
        • Closed systems (internals hidden from users)
        • Adding new algorithms or customizing old ones -- Difficult
        • Poor integration with SQL
        • Limited interoperability across DBMSs
      • Predictive Markup Modeling Language (PMML) as a palliative
    • 14. PMML
      • Predictive Markup Model Language
        • XML based language for vendor independent definition of statistical and data mining models
        • Share models among PMML compliant products
        • A descriptive language
      • Supported by all major vendors
    • 15. PMML Example
    • 16. The Data Mining Software Vendors Market Competition The Data Mining World According to
    • 17. Disclaimer Disclaimer This presentation contains preliminary information that may be changed substantially prior to final commercial release of the software described herein. The information contained in this presentation represents the current view of Microsoft Corporation on the issues discussed as of the date of the presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of the presentation. This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this presentation. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this information does not give you any license to these patents, trademarks, copyrights, or other intellectual property. © 2005 Microsoft Corporation. All rights reserved.
    • 18. Major Data Mining Vendors
      • Platforms
        • IBM
        • Oracle
        • SAS
      • Tools
        • SPSS
        • Angoss
        • KXEN
        • Megaputer
        • FairIsaac
        • Insightful
    • 19. Competition High price. Standard Functionality. Poor API (SQL MM). Confusing product line. Mature product (6 years). Good service model. Scoring inside relational engine. Strong partnership with SAS DB2 IM Scoring module is for developers; Other modules are for analysts. Additional Packages WebSphere Portal (vertical solution) IM Visualization Excel AddIn 10 Yes 6 SQL MM/6 based on UDF, SQL SPROC DB2 Intelligent Miner, WebSphere IBM Expensive. Proprietary. Customer relations range from congenial to hostile. Mature, Market Leader. Extensive customization and modelling abilities. Robust, industry tested and accepted algorithms and methodologies. Export to DB2 Scoring. Analysts Separate Product None Dozens Yes 8+ SAS Script Enterprise Miner SAS Good credibility with enterprise customers New GUI, Leader of JDM API CRM Integration Powerful yet simple API Integration with other BI technologies New GUI Strengths http:// Link Yes Yes Text Mining Oracle Data Mining SQL Server Analysis Services Product API overly complex Inconsistent Not in-process with relational engine Lacking statistical functions Poor Analyst experience Weaknesses Developers Developers Target Additional Package Included Distribution Analysis tools, Web-based targeted reports Discoverer Embeddable Viewers, Reporting Services Client Tools 18 N/A Marketing Pages 8 7 (+2) Algorithms Java DM, PL/SQL OLEDB/DM, DMX, XMLA, ADOMD.Net API Oracle 10g SQL Server 2005
    • 20. Major DM
      • Platforms
        • IBM
        • Oracle
        • SAS,
      • Tools
        • SPSS
        • Angoss
        • KXEN
        • Megaputer
        • FairIsaac
        • Insightful
      • SAS Institute (Enterprise Miner)
      • IBM (DB2 Intelligent Miner for Data)
      • Oracle (ODM option to Oracle 10g)
      • SPSS (Clementine)
      • Unica Technologies, Inc. (Pattern Recognition Workbench)
      • Insightsful (Insightful Miner)
      • KXEN (Analytic Framework)
      • Prudsys (Discoverer and its family)
      • Microsoft (SQL Server 2005)
      • Angoss (KnowledgeServer and its family)
      • DBMiner (DBMiner)
      • etc…
    • 21. ORACLE
      • Strengths
        • Oracle Data Mining (ODM) Integrated into relational engine
          • Performance benefits
          • Management integration
          • SQL Language integration
        • ODM Client
          • “ Walks through” Data Mining Process
          • Data Mining tailored data preparation
          • Generates code
        • Integration into Oracle CRM
          • “ EZ” Data Mining for customer churn, other applications
        • Full suite of algorithms
          • Typical algorithms, plus text mining and bioinformatics
        • Nice marketing/user education
    • 22. ORACLE
      • Weaknesses
        • Additional Licensing Fees (base $400/user, $20K proc)
        • Confusing API Story
          • Certain features only work with Java API
          • Certain features only work with PL/SQL API
          • Same features work differently with different API’s
        • Difficult to use
          • Different modeling concepts for each algorithm
        • Poor connectivity – ORACLE only
    • 23. SAS
      • Entrenched Data Mining Leader
        • Market Share
        • Mind Share
      • “ Best of Breed”
        • Always will attract the top ?% of customers
      • Overall poor product
        • Only for the expert user (SAS Philosophy)
        • Integration of results generally involves source code
      • Integrated with ETL, other SAS tools
      • Partnership with IBM
        • Model in SAS, deploy in DB2
    • 24. Our View ...
        • Progress toward high level data models and integration with SQL, but
        • Closed systems ,
        • Lacking in coverage and user-extensibility .
        • Not as popular as dedicated, stand-alone DM systems, such as Weka .
    • 25. Weka
      • A comprehensive set of DM algorithms, and tools.
      • Generic algorithms over arbitrary data sets.
        • Independent on the number of columns in tables.
      • Open and extensible system based on Java.
      • These are the features that we want in our Inductive DSMS---starting from SQL rather than Java!
    • 26. References
      • [ Imielinski’ 96] Tomasz Imielinski and Heikki Mannila. A database perspective on knowledge discovery. Commun. ACM , 39(11):58–64, 1996.
      • Carlo Zaniolo: Mining Databases and Data Streamswith Query Languages and Rules: Invited Talk, Fourth International Workshop on Knowledge Discovery in Inductive Databases, KDID 2005.
    • 27.
      • Thank you!