Applied Enterprise Semantic
Mining
Mark Tabladillo Ph.D.
Data Mining Architect
SQL Saturday Atlanta April 14, 2012
About MarkTab

 20+ Years in Atlanta
   Consulting since 1998; Incorporated 2003
   Part-Time Faculty at University of Phoenix
 SAS and Microsoft Expert
   Presenter since 1998 at conferences like TechEd
   and SAS Global Forum
 http://marktab.com       @MarkTabNet
Introduction

 SQL Server 2012 has new Programmability
 Enhancements
   Statistical Semantic Search
   File Tables
   Full-Text Search Improvements
 These combined technologies make SQL
 Server 2012 a strong contender in text mining
PROBLEM STATEMENT
Challenges

  Building and Maintaining Applications with
  relational and non-relational data is hard
    Complex integration
    Duplicated functionality
    Compensation for unavailable services
  80% of all data is not stored in databases!
  Most of it is “unstructured”

(2012, Michael Rys, Microsoft)
MICROSOFT AND GOOGLE
History

 July 2008
   Microsoft purchases Powerset for US$100 Million
   Google Dismisses Semantic Search
   http://venturebeat.com/2008/06/26/microsoft-to-
   buy-semantic-search-engine-powerset-for-100m-
   plus/
   http://www.forbes.com/2008/07/01/powerset-msft-
   search-tech-intel-cx_ag_0701powerset.html
History

 March 2009
   Google announces “snippets” as relevant to
   search
   The media picks this story up as “semantic
   search”
   http://googleblog.blogspot.com/2009/03/two-new-
   improvements-to-google-
   results.html#!/2009/03/two-new-improvements-to-
   google-results.html
History

 February 2012
   Google announces Knowledge Graph, an explicit
   application of semantic search
   http://mashable.com/2012/02/13/google-
   knowledge-graph-change-search/
History

 April 2012
   Microsoft purchases 800+ patents from AOL for
   US$1 Billion
   Among the patents are semantic search and
   metadata querying – older than Google
   http://www.theregister.co.uk/2012/04/09/aol_micr
   osoft_patent_deal/
PURPOSE STATEMENT
(SQL SERVER)
Goals

  Reduce the cost of managing all data
  Simplify the development of applications over
  all data
  Provide management and programming
  services for all data
  Make SQL Server the preferred choice for
  managing Unstructured Data and allow
  building Rich Application Experience on top
(2012, Michael Rys, Microsoft)
http://msdn.microsoft.com/en-us/library/cc645577.aspx

NEW IN SQL SERVER 2012
Statistical Semantic Search

 Identifies statistically relevant key phrases
 Based on these phrases, can identify (by
 score) similar documents
FileTables

 Built on existing SQL Server FILESTREAM
 technology
 Files and documents
   Stored in special tables in SQL Server
   Accessed if they were stored in the file system
Full-Text Search Enhancements

 Property search: search on tagged properties
 (such as author or title)
 Customizable NEAR: find words or phrases
 close to one another
 New Word Breakers and Stemmers (for many
 languages)
HOW DOES SEMANTIC
SEARCH WORK?
From Documents to Output

                Office

     Varchar
                           PDF
     NVarchar
                Rowset
                Output
                 with
                Scores
“Beyond Relational” vs. “Adoption”

 Start with unstructured (meaning non-
 relational) data
 Use Windows technology
   Reading and Writing Files (Win32 API)
   iFilters for reading proprietary formats
 Develop indexed structure from unstructured
 data
(iFilter Required)


                            iFilters   Full-Text
  Documents                            Keyword
                                        Index
                                         “FTI”


                                       Semantic
                                          Key
                                        Phrase
                            Semantic
   Semantic Document                    Index –
                            Database
   Similarity Index “DSI”              Tag Index
                                          “TI”
“iFilter”?

  IFilters are components that allow search
  services to index content of specific file types,
  letting you search for content in those files.
  They are intended for use with Microsoft
  Search Services (Sharepoint, SQL,
  Exchange, Windows Search).
Microsoft Office 2010 Filters Pack

 Legacy Office Filter (97-2003; .doc, .ppt, .xls)
 Metro Office Filter (2007; .docx, .pptx, .xlsx)
 Zip Filter
 OneNote filter
 Visio Filter
 Publisher Filter
 Open Document Format Filter
Adobe PDF iFilter 9 for 64-bit platforms

 Allows PDF search
 Not currently supported for Windows 7
   But I used it anyway ☺
 Add the Bin directory to your path
   Computer (right click), Properties, Advanced
   System Settings, Environment Variables
“Semantic Language Statistics Database”?

 This database contains the statistical
 language models required by semantic
 search.
 A single semantic language statistics
 database contains the language models for
 all the languages that are supported for
 semantic indexing.
Languages Currently Supported
 Traditional Chinese
 German
 English
 French
 Italian
 Brazilian
 Russian
 Swedish
 Simplified Chinese
 British English
 Portuguese
 Chinese (Hong Kong SAR, PRC)
 Spanish
 Chinese (Singapore)
 Chinese (Macau SAR)
PERFORMANCE
Phases of Semantic Indexing

   Full Text Keyword Index
             “FTI”
                                             Semantic Document
                                             Similarity Index “DSI”
 Semantic Key Phrase Index
            –
      Tag Index “TI”




 http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing
Integrated Full Text Search (iFTS)

 Improved Performance and Scale:
   Scale-up to 350M documents for storage and
   search
   iFTS query performance 7-10 times faster than in
   SQL Server 2008
   Worst-case iFTS query response times less than
   3 sec for corpus
   Similar or better than main database search
   competitors
 (2012, Michael Rys, Microsoft)
Linear Scale of FTI/TI/DSI

 First known linearly scaling end-to-end
 Search and Semantic product in the industry




 Time in Seconds vs. Number of Documents
 (2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft)
Conclusion

 SQL Server 2012 adds new text processing
 capabilities
 This technology scales linearly
 Microsoft invites millions of documents for
 enterprise-level applications
Network

 MarkTab Consulting
   http://marktab.com
 Blog
   http://marktab.net
 Twitter
   @marktabnet
APPENDIX
References

 Video
   http://channel9.msdn.com/Shows/DataBound/DataBo
   und-Episode-2-Semantic-Search
   http://www.microsoftpdc.com/2009/SVR32
 Semantic Search (Books Online) – explains the
 demo
   http://msdn.microsoft.com/en-
   us/library/gg492075.aspx
 Paper
   http://users.cis.fiu.edu/~lzhen001/activities/KDD2011
   Program/docs/p213.pdf
Demo: My Semantic Search Sample

 http://mysemanticsearch.codeplex.com/
 Requires:
   iFilters
   Semantic Language Statistics Database
   IIS7, IIS6, with Windows Authentication
   .NET 4.0
   Silverlight 4.0
   FILESTREAM (complete)
Demo: T-SQL and Documents

 Naveen Garg
 Requires Adventure Works (from Codeplex)
 http://blogs.msdn.com/b/sqlfts/archive/2011/0
 7/21/introducing-fulltext-statistical-semantic-
 search-in-sql-server-codename-denali-
 release.aspx
Abstract
 SQL Server 2012 debuts a new Semantic Platform
 (commonly known as the applied task, Semantic
 Search). This text mining technology leverages the
 already established Full Text Index, and builds
 semantic indexes in a two-phase process. This
 presentation provides a science description and
 demo for the Enterprise implementation of Tag Index
 and Document Similarity Index. At present (RTM),
 the indexes work for 15 languages. Included are
 strategy tips for how to best leverage the technology
 along with already-existing Microsoft text mining and
 data mining.

Sql Saturday 111 Atlanta applied enterprise semantic mining

  • 1.
    Applied Enterprise Semantic Mining MarkTabladillo Ph.D. Data Mining Architect SQL Saturday Atlanta April 14, 2012
  • 2.
    About MarkTab 20+Years in Atlanta Consulting since 1998; Incorporated 2003 Part-Time Faculty at University of Phoenix SAS and Microsoft Expert Presenter since 1998 at conferences like TechEd and SAS Global Forum http://marktab.com @MarkTabNet
  • 3.
    Introduction SQL Server2012 has new Programmability Enhancements Statistical Semantic Search File Tables Full-Text Search Improvements These combined technologies make SQL Server 2012 a strong contender in text mining
  • 4.
  • 5.
    Challenges Buildingand Maintaining Applications with relational and non-relational data is hard Complex integration Duplicated functionality Compensation for unavailable services 80% of all data is not stored in databases! Most of it is “unstructured” (2012, Michael Rys, Microsoft)
  • 6.
  • 7.
    History July 2008 Microsoft purchases Powerset for US$100 Million Google Dismisses Semantic Search http://venturebeat.com/2008/06/26/microsoft-to- buy-semantic-search-engine-powerset-for-100m- plus/ http://www.forbes.com/2008/07/01/powerset-msft- search-tech-intel-cx_ag_0701powerset.html
  • 8.
    History March 2009 Google announces “snippets” as relevant to search The media picks this story up as “semantic search” http://googleblog.blogspot.com/2009/03/two-new- improvements-to-google- results.html#!/2009/03/two-new-improvements-to- google-results.html
  • 9.
    History February 2012 Google announces Knowledge Graph, an explicit application of semantic search http://mashable.com/2012/02/13/google- knowledge-graph-change-search/
  • 10.
    History April 2012 Microsoft purchases 800+ patents from AOL for US$1 Billion Among the patents are semantic search and metadata querying – older than Google http://www.theregister.co.uk/2012/04/09/aol_micr osoft_patent_deal/
  • 11.
  • 12.
    Goals Reducethe cost of managing all data Simplify the development of applications over all data Provide management and programming services for all data Make SQL Server the preferred choice for managing Unstructured Data and allow building Rich Application Experience on top (2012, Michael Rys, Microsoft)
  • 13.
  • 14.
    Statistical Semantic Search Identifies statistically relevant key phrases Based on these phrases, can identify (by score) similar documents
  • 15.
    FileTables Built onexisting SQL Server FILESTREAM technology Files and documents Stored in special tables in SQL Server Accessed if they were stored in the file system
  • 16.
    Full-Text Search Enhancements Property search: search on tagged properties (such as author or title) Customizable NEAR: find words or phrases close to one another New Word Breakers and Stemmers (for many languages)
  • 17.
  • 18.
    From Documents toOutput Office Varchar PDF NVarchar Rowset Output with Scores
  • 19.
    “Beyond Relational” vs.“Adoption” Start with unstructured (meaning non- relational) data Use Windows technology Reading and Writing Files (Win32 API) iFilters for reading proprietary formats Develop indexed structure from unstructured data
  • 20.
    (iFilter Required) iFilters Full-Text Documents Keyword Index “FTI” Semantic Key Phrase Semantic Semantic Document Index – Database Similarity Index “DSI” Tag Index “TI”
  • 21.
    “iFilter”? IFiltersare components that allow search services to index content of specific file types, letting you search for content in those files. They are intended for use with Microsoft Search Services (Sharepoint, SQL, Exchange, Windows Search).
  • 22.
    Microsoft Office 2010Filters Pack Legacy Office Filter (97-2003; .doc, .ppt, .xls) Metro Office Filter (2007; .docx, .pptx, .xlsx) Zip Filter OneNote filter Visio Filter Publisher Filter Open Document Format Filter
  • 23.
    Adobe PDF iFilter9 for 64-bit platforms Allows PDF search Not currently supported for Windows 7 But I used it anyway ☺ Add the Bin directory to your path Computer (right click), Properties, Advanced System Settings, Environment Variables
  • 24.
    “Semantic Language StatisticsDatabase”? This database contains the statistical language models required by semantic search. A single semantic language statistics database contains the language models for all the languages that are supported for semantic indexing.
  • 25.
    Languages Currently Supported Traditional Chinese German English French Italian Brazilian Russian Swedish Simplified Chinese British English Portuguese Chinese (Hong Kong SAR, PRC) Spanish Chinese (Singapore) Chinese (Macau SAR)
  • 26.
  • 27.
    Phases of SemanticIndexing Full Text Keyword Index “FTI” Semantic Document Similarity Index “DSI” Semantic Key Phrase Index – Tag Index “TI” http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing
  • 28.
    Integrated Full TextSearch (iFTS) Improved Performance and Scale: Scale-up to 350M documents for storage and search iFTS query performance 7-10 times faster than in SQL Server 2008 Worst-case iFTS query response times less than 3 sec for corpus Similar or better than main database search competitors (2012, Michael Rys, Microsoft)
  • 29.
    Linear Scale ofFTI/TI/DSI First known linearly scaling end-to-end Search and Semantic product in the industry Time in Seconds vs. Number of Documents (2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft)
  • 30.
    Conclusion SQL Server2012 adds new text processing capabilities This technology scales linearly Microsoft invites millions of documents for enterprise-level applications
  • 31.
    Network MarkTab Consulting http://marktab.com Blog http://marktab.net Twitter @marktabnet
  • 32.
  • 33.
    References Video http://channel9.msdn.com/Shows/DataBound/DataBo und-Episode-2-Semantic-Search http://www.microsoftpdc.com/2009/SVR32 Semantic Search (Books Online) – explains the demo http://msdn.microsoft.com/en- us/library/gg492075.aspx Paper http://users.cis.fiu.edu/~lzhen001/activities/KDD2011 Program/docs/p213.pdf
  • 34.
    Demo: My SemanticSearch Sample http://mysemanticsearch.codeplex.com/ Requires: iFilters Semantic Language Statistics Database IIS7, IIS6, with Windows Authentication .NET 4.0 Silverlight 4.0 FILESTREAM (complete)
  • 35.
    Demo: T-SQL andDocuments Naveen Garg Requires Adventure Works (from Codeplex) http://blogs.msdn.com/b/sqlfts/archive/2011/0 7/21/introducing-fulltext-statistical-semantic- search-in-sql-server-codename-denali- release.aspx
  • 36.
    Abstract SQL Server2012 debuts a new Semantic Platform (commonly known as the applied task, Semantic Search). This text mining technology leverages the already established Full Text Index, and builds semantic indexes in a two-phase process. This presentation provides a science description and demo for the Enterprise implementation of Tag Index and Document Similarity Index. At present (RTM), the indexes work for 15 languages. Included are strategy tips for how to best leverage the technology along with already-existing Microsoft text mining and data mining.