Applied Enterprise
Semantic Mining
Mark Tabladillo, Ph.D. (MVP, MCAD .NET, MCITP, MCT)
PASS SQL Saturday #198 Vancouver BC
February 16, 2013
Photos © 2013 Mark Tabladillo, All Rights Reserved
Photos © 2013 Mark Tabladillo, All Rights Reserved
Networking
Interactive
About MarkTab
Training and Consulting with        Ph.D. – Industrial Engineering,
http://marktab.com                  Georgia Tech
Data Mining Resources and Blog at   Training and consulting
http://marktab.net                  internationally across many
                                    industries – SAS and Microsoft
                                    Contributed to peer-reviewed
                                    research and legislation
                                      Mentoring doctoral dissertations at the
                                      accredited University of Phoenix
                                    Presenter
Quick Look
My Semantic Search
Interactive
Name three things you want from enterprise text
mining
Introduction
SQL Server 2012 has new Programmability Enhancements
  Statistical Semantic Search
  File Tables
  Full-Text Search Improvements
These combined technologies make SQL Server 2012 a strong contender in text
mining
Outline
Why Microsoft is competitive for data mining
Definitions: what is text mining?
History: how Microsoft’s semantic search was born
What is inside semantic search
 Logical model
 Demos
 Performance
Microsoft Resources
Why Microsoft is
Competitive for Data
Mining
Based on 2012 and 2013 Surveys
Gartner 2013
           Magic Quadrant for
           Business Intelligence
           and Analytics
           Platforms




  Retrieved from http://www.gartner.com/technology/reprints.do?id=1-1DZLPEH&ct=130207&st=sb
  – February 5, 2013
Gartner 2013
           Magic Quadrant for
           Data Warehouse
           Database
           Management
           Systems




  Retrieved from http://www.gartner.com/technology/reprints.do?id=1-1DU2VD4&ct=130131&st=sb
  – January 31, 2013
KDNuggets 2012
http://marktab.net/datamining/2012/06/15/excel-number-
commercial-tool-analytics-data-mining-big-data/
Definitions
What is text mining?
Definition
Data mining is the automated or semi-automated process of
discovering patterns in data
  Text mining is the automated or semi-automated process of
  discovering patterns from textual data
Machine learning is the development and optimization of
algorithms for automated or semi-automated pattern discovery
Purposes
    Phrase          Goal

    “Data Mining”   Inform actionable decisions
    “Text Mining”


    “Machine        Determine best performing
    Learning”       algorithm
MarkTab Decision Cycle
                             GO




           Synthesis                 Analysis
               (art)                (science)


         Science needs science fiction -- MarkTab
MarkTab Decision Cycle
                      GO




          Synthesis        Analysis
            (art)          (science)
History
How Microsoft’s semantic search came to be
History
July 2008
  Microsoft purchases Powerset for US$100 Million
  Google Dismisses Semantic Search
  http://venturebeat.com/2008/06/26/microsoft-to-buy-semantic-search-engine-
  powerset-for-100m-plus/
  http://www.forbes.com/2008/07/01/powerset-msft-search-tech-intel-
  cx_ag_0701powerset.html
History
March 2009
 Google announces “snippets” as relevant to search
 The media picks this story up as “semantic search”
 http://googleblog.blogspot.com/2009/03/two-new-improvements-to-google-
 results.html#!/2009/03/two-new-improvements-to-google-results.html
History
February 2012
  Google announces Knowledge Graph, an explicit application of semantic search
  http://mashable.com/2012/02/13/google-knowledge-graph-change-search/
History
April 2012
  Microsoft purchases 800+ patents from AOL for US$1 Billion
  Among the patents are semantic search and metadata querying – older than
  Google
  http://www.theregister.co.uk/2012/04/09/aol_microsoft_patent_deal/
What is inside Semantic
Search
Text Mining introduced for SQL Server 2012
Future: Most data is Text
Two Research Types
• Quantitative research = data mining
• Qualitative research = text mining
The future is combining both
Statistical Semantic Search
Comprises some aspects of text mining
Identifies statistically relevant key phrases
Based on these phrases, can identify (by score) similar documents
FileTables
Built on existing SQL Server FILESTREAM technology
Files and documents
   Stored in special tables in SQL Server
   Accessed if they were stored in the file system
Full-Text Search Enhancements
Property search: search on tagged properties (such as author or title)
Customizable NEAR: find words or phrases close to one another
New Word Breakers and Stemmers (for many languages)
Logical Model
How semantic search works
From Documents to Output
                    Office
         Varchar
                                 PDF
        NVarchar
                     Rowset
                     Output
                   with Scores
(iFilter Required)
                                  iFilters   Full-Text
       Documents                             Keyword
                                              Index
                                               “FTI”



                                              Semantic
                                             Key Phrase
                                  Semantic     Index –
         Semantic Document        Database    Tag Index
         Similarity Index “DSI”                  “TI”
Languages Currently Supported
Traditional Chinese   Simplified Chinese
German                British English
English               Portuguese
French                Chinese (Hong Kong SAR, PRC)
Italian               Spanish
Brazilian             Chinese (Singapore)
Russian               Chinese (Macau SAR)
Swedish
Phases of Semantic Indexing
      Full Text Keyword Index “FTI”

                                                 Semantic Document Similarity
                                                         Index “DSI”
      Semantic Key Phrase Index –
            Tag Index “TI”




     http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing
Interactive Demo
SQL Server Management Studio
Semantic Search and
SQL Server Data Mining
SQL Server Data Tools: data mining plus text mining
Performance
The Million-Dollar Edge
Integrated Full Text Search (iFTS)
Improved Performance and Scale:
  Scale-up to 350M documents for storage and search
  iFTS query performance 7-10 times faster than in SQL Server 2008
  Worst-case iFTS query response times less than 3 sec for corpus
  Similar or better than main database search competitors
(2012, Michael Rys, Microsoft)
Linear Scale of FTI/TI/DSI
First known linearly scaling end-to-end Search and Semantic product in the industry




            Time in Seconds vs. Number of Documents
            (2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft)
Text Mining References
Video
  http://channel9.msdn.com/Shows/DataBound/DataBound-Episode-2-Semantic-
  Search
  http://www.microsoftpdc.com/2009/SVR32
Semantic Search (Books Online) – explains the demo
  http://msdn.microsoft.com/en-us/library/gg492075.aspx
Paper
  http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p213.pdf
Microsoft Resources
Links
Software
SQL Server 2012 Enterprise
(includes database engine, Analysis Services, SSMS and SSDT)
 http://www.microsoft.com/sqlserver/en/us/get-sql-server/try-it.aspx
Microsoft Office 2012 Professional
 http://office.microsoft.com/en-us/try
Organizations
 Professional Association for SQL Server http://www.sqlpass.org
   Atlanta MDF http://www.atlantamdf.com/
   Atlanta Microsoft BI Users Group http://www.meetup.com/Atlanta-Microsoft-
   Business-Intelligence-Users/
PASS Business Analytics Conference http://www.passbaconference.com
Microsoft TechEd North America http://northamerica.msteched.com/
Interactive
Takeaways
Conclusion
SQL Server Data Mining 2012 provides data mining and semantic search
The core technology allows document similarity matching
The results can be combined with SQL Server Data Mining (such as
Association Analysis)
Connect
Data Mining Resources and blog http://marktab.net
Data Mining Training and Consulting (especially Microsoft and SAS)
http://marktab.com

Applied Semantic Search with Microsoft SQL Server

  • 1.
    Applied Enterprise Semantic Mining MarkTabladillo, Ph.D. (MVP, MCAD .NET, MCITP, MCT) PASS SQL Saturday #198 Vancouver BC February 16, 2013
  • 2.
    Photos © 2013Mark Tabladillo, All Rights Reserved
  • 3.
    Photos © 2013Mark Tabladillo, All Rights Reserved
  • 4.
  • 5.
    About MarkTab Training andConsulting with Ph.D. – Industrial Engineering, http://marktab.com Georgia Tech Data Mining Resources and Blog at Training and consulting http://marktab.net internationally across many industries – SAS and Microsoft Contributed to peer-reviewed research and legislation Mentoring doctoral dissertations at the accredited University of Phoenix Presenter
  • 6.
  • 7.
    Interactive Name three thingsyou want from enterprise text mining
  • 8.
    Introduction SQL Server 2012has new Programmability Enhancements Statistical Semantic Search File Tables Full-Text Search Improvements These combined technologies make SQL Server 2012 a strong contender in text mining
  • 9.
    Outline Why Microsoft iscompetitive for data mining Definitions: what is text mining? History: how Microsoft’s semantic search was born What is inside semantic search Logical model Demos Performance Microsoft Resources
  • 10.
    Why Microsoft is Competitivefor Data Mining Based on 2012 and 2013 Surveys
  • 11.
    Gartner 2013 Magic Quadrant for Business Intelligence and Analytics Platforms Retrieved from http://www.gartner.com/technology/reprints.do?id=1-1DZLPEH&ct=130207&st=sb – February 5, 2013
  • 12.
    Gartner 2013 Magic Quadrant for Data Warehouse Database Management Systems Retrieved from http://www.gartner.com/technology/reprints.do?id=1-1DU2VD4&ct=130131&st=sb – January 31, 2013
  • 13.
  • 14.
  • 15.
    Definition Data mining isthe automated or semi-automated process of discovering patterns in data Text mining is the automated or semi-automated process of discovering patterns from textual data Machine learning is the development and optimization of algorithms for automated or semi-automated pattern discovery
  • 16.
    Purposes Phrase Goal “Data Mining” Inform actionable decisions “Text Mining” “Machine Determine best performing Learning” algorithm
  • 17.
    MarkTab Decision Cycle GO Synthesis Analysis (art) (science) Science needs science fiction -- MarkTab
  • 18.
    MarkTab Decision Cycle GO Synthesis Analysis (art) (science)
  • 19.
  • 20.
    History July 2008 Microsoft purchases Powerset for US$100 Million Google Dismisses Semantic Search http://venturebeat.com/2008/06/26/microsoft-to-buy-semantic-search-engine- powerset-for-100m-plus/ http://www.forbes.com/2008/07/01/powerset-msft-search-tech-intel- cx_ag_0701powerset.html
  • 21.
    History March 2009 Googleannounces “snippets” as relevant to search The media picks this story up as “semantic search” http://googleblog.blogspot.com/2009/03/two-new-improvements-to-google- results.html#!/2009/03/two-new-improvements-to-google-results.html
  • 22.
    History February 2012 Google announces Knowledge Graph, an explicit application of semantic search http://mashable.com/2012/02/13/google-knowledge-graph-change-search/
  • 23.
    History April 2012 Microsoft purchases 800+ patents from AOL for US$1 Billion Among the patents are semantic search and metadata querying – older than Google http://www.theregister.co.uk/2012/04/09/aol_microsoft_patent_deal/
  • 24.
    What is insideSemantic Search Text Mining introduced for SQL Server 2012
  • 25.
    Future: Most datais Text Two Research Types • Quantitative research = data mining • Qualitative research = text mining The future is combining both
  • 26.
    Statistical Semantic Search Comprisessome aspects of text mining Identifies statistically relevant key phrases Based on these phrases, can identify (by score) similar documents
  • 27.
    FileTables Built on existingSQL Server FILESTREAM technology Files and documents Stored in special tables in SQL Server Accessed if they were stored in the file system
  • 28.
    Full-Text Search Enhancements Propertysearch: search on tagged properties (such as author or title) Customizable NEAR: find words or phrases close to one another New Word Breakers and Stemmers (for many languages)
  • 29.
  • 30.
    From Documents toOutput Office Varchar PDF NVarchar Rowset Output with Scores
  • 31.
    (iFilter Required) iFilters Full-Text Documents Keyword Index “FTI” Semantic Key Phrase Semantic Index – Semantic Document Database Tag Index Similarity Index “DSI” “TI”
  • 32.
    Languages Currently Supported TraditionalChinese Simplified Chinese German British English English Portuguese French Chinese (Hong Kong SAR, PRC) Italian Spanish Brazilian Chinese (Singapore) Russian Chinese (Macau SAR) Swedish
  • 33.
    Phases of SemanticIndexing Full Text Keyword Index “FTI” Semantic Document Similarity Index “DSI” Semantic Key Phrase Index – Tag Index “TI” http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing
  • 34.
    Interactive Demo SQL ServerManagement Studio
  • 35.
    Semantic Search and SQLServer Data Mining SQL Server Data Tools: data mining plus text mining
  • 36.
  • 37.
    Integrated Full TextSearch (iFTS) Improved Performance and Scale: Scale-up to 350M documents for storage and search iFTS query performance 7-10 times faster than in SQL Server 2008 Worst-case iFTS query response times less than 3 sec for corpus Similar or better than main database search competitors (2012, Michael Rys, Microsoft)
  • 38.
    Linear Scale ofFTI/TI/DSI First known linearly scaling end-to-end Search and Semantic product in the industry Time in Seconds vs. Number of Documents (2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft)
  • 39.
    Text Mining References Video http://channel9.msdn.com/Shows/DataBound/DataBound-Episode-2-Semantic- Search http://www.microsoftpdc.com/2009/SVR32 Semantic Search (Books Online) – explains the demo http://msdn.microsoft.com/en-us/library/gg492075.aspx Paper http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p213.pdf
  • 40.
  • 41.
    Software SQL Server 2012Enterprise (includes database engine, Analysis Services, SSMS and SSDT) http://www.microsoft.com/sqlserver/en/us/get-sql-server/try-it.aspx Microsoft Office 2012 Professional http://office.microsoft.com/en-us/try
  • 42.
    Organizations Professional Associationfor SQL Server http://www.sqlpass.org Atlanta MDF http://www.atlantamdf.com/ Atlanta Microsoft BI Users Group http://www.meetup.com/Atlanta-Microsoft- Business-Intelligence-Users/ PASS Business Analytics Conference http://www.passbaconference.com Microsoft TechEd North America http://northamerica.msteched.com/
  • 43.
  • 44.
    Conclusion SQL Server DataMining 2012 provides data mining and semantic search The core technology allows document similarity matching The results can be combined with SQL Server Data Mining (such as Association Analysis)
  • 45.
    Connect Data Mining Resourcesand blog http://marktab.net Data Mining Training and Consulting (especially Microsoft and SAS) http://marktab.com