Applied Enterprise SemanticMiningMark Tabladillo Ph.D.Data Mining ArchitectSQL Saturday Atlanta April 14, 2012
About MarkTab 20+ Years in Atlanta   Consulting since 1998; Incorporated 2003   Part-Time Faculty at University of Phoenix...
Introduction SQL Server 2012 has new Programmability Enhancements   Statistical Semantic Search   File Tables   Full-Text ...
PROBLEM STATEMENT
Challenges  Building and Maintaining Applications with  relational and non-relational data is hard    Complex integration ...
MICROSOFT AND GOOGLE
History July 2008   Microsoft purchases Powerset for US$100 Million   Google Dismisses Semantic Search   http://venturebea...
History March 2009   Google announces “snippets” as relevant to   search   The media picks this story up as “semantic   se...
History February 2012   Google announces Knowledge Graph, an explicit   application of semantic search   http://mashable.c...
History April 2012   Microsoft purchases 800+ patents from AOL for   US$1 Billion   Among the patents are semantic search ...
PURPOSE STATEMENT(SQL SERVER)
Goals  Reduce the cost of managing all data  Simplify the development of applications over  all data  Provide management a...
http://msdn.microsoft.com/en-us/library/cc645577.aspxNEW IN SQL SERVER 2012
Statistical Semantic Search Identifies statistically relevant key phrases Based on these phrases, can identify (by score) ...
FileTables Built on existing SQL Server FILESTREAM technology Files and documents   Stored in special tables in SQL Server...
Full-Text Search Enhancements Property search: search on tagged properties (such as author or title) Customizable NEAR: fi...
HOW DOES SEMANTICSEARCH WORK?
From Documents to Output                Office     Varchar                           PDF     NVarchar                Rowse...
“Beyond Relational” vs. “Adoption” Start with unstructured (meaning non- relational) data Use Windows technology   Reading...
(iFilter Required)                            iFilters   Full-Text  Documents                            Keyword          ...
“iFilter”?  IFilters are components that allow search  services to index content of specific file types,  letting you sear...
Microsoft Office 2010 Filters Pack Legacy Office Filter (97-2003; .doc, .ppt, .xls) Metro Office Filter (2007; .docx, .ppt...
Adobe PDF iFilter 9 for 64-bit platforms Allows PDF search Not currently supported for Windows 7   But I used it anyway ☺ ...
“Semantic Language Statistics Database”? This database contains the statistical language models required by semantic searc...
Languages Currently Supported Traditional Chinese German English French Italian Brazilian Russian Swedish Simplified Chine...
PERFORMANCE
Phases of Semantic Indexing   Full Text Keyword Index             “FTI”                                             Semant...
Integrated Full Text Search (iFTS) Improved Performance and Scale:   Scale-up to 350M documents for storage and   search  ...
Linear Scale of FTI/TI/DSI First known linearly scaling end-to-end Search and Semantic product in the industry Time in Sec...
Conclusion SQL Server 2012 adds new text processing capabilities This technology scales linearly Microsoft invites million...
Network MarkTab Consulting   http://marktab.com Blog   http://marktab.net Twitter   @marktabnet
APPENDIX
References Video   http://channel9.msdn.com/Shows/DataBound/DataBo   und-Episode-2-Semantic-Search   http://www.microsoftp...
Demo: My Semantic Search Sample http://mysemanticsearch.codeplex.com/ Requires:   iFilters   Semantic Language Statistics ...
Demo: T-SQL and Documents Naveen Garg Requires Adventure Works (from Codeplex) http://blogs.msdn.com/b/sqlfts/archive/2011...
Abstract SQL Server 2012 debuts a new Semantic Platform (commonly known as the applied task, Semantic Search). This text m...
Upcoming SlideShare
Loading in …5
×

Sql Saturday 111 Atlanta applied enterprise semantic mining

1,418 views
1,324 views

Published on

SQL Server 2012 debuts a new Semantic Platform (commonly known as the applied task, Semantic Search). This text mining technology leverages the already established Full Text Index, and builds semantic indexes in a two-phase process. This presentation provides a science description and demo for the Enterprise implementation of Tag Index and Document Similarity Index. At present (RTM), the indexes work for 15 languages. Included are strategy tips for how to best leverage the technology along with already-existing Microsoft text mining and data mining.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,418
On SlideShare
0
From Embeds
0
Number of Embeds
54
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Sql Saturday 111 Atlanta applied enterprise semantic mining

  1. 1. Applied Enterprise SemanticMiningMark Tabladillo Ph.D.Data Mining ArchitectSQL Saturday Atlanta April 14, 2012
  2. 2. About MarkTab 20+ Years in Atlanta Consulting since 1998; Incorporated 2003 Part-Time Faculty at University of Phoenix SAS and Microsoft Expert Presenter since 1998 at conferences like TechEd and SAS Global Forum http://marktab.com @MarkTabNet
  3. 3. Introduction SQL Server 2012 has new Programmability Enhancements Statistical Semantic Search File Tables Full-Text Search Improvements These combined technologies make SQL Server 2012 a strong contender in text mining
  4. 4. PROBLEM STATEMENT
  5. 5. Challenges Building and Maintaining Applications with relational and non-relational data is hard Complex integration Duplicated functionality Compensation for unavailable services 80% of all data is not stored in databases! Most of it is “unstructured”(2012, Michael Rys, Microsoft)
  6. 6. MICROSOFT AND GOOGLE
  7. 7. History July 2008 Microsoft purchases Powerset for US$100 Million Google Dismisses Semantic Search http://venturebeat.com/2008/06/26/microsoft-to- buy-semantic-search-engine-powerset-for-100m- plus/ http://www.forbes.com/2008/07/01/powerset-msft- search-tech-intel-cx_ag_0701powerset.html
  8. 8. History March 2009 Google announces “snippets” as relevant to search The media picks this story up as “semantic search” http://googleblog.blogspot.com/2009/03/two-new- improvements-to-google- results.html#!/2009/03/two-new-improvements-to- google-results.html
  9. 9. History February 2012 Google announces Knowledge Graph, an explicit application of semantic search http://mashable.com/2012/02/13/google- knowledge-graph-change-search/
  10. 10. History April 2012 Microsoft purchases 800+ patents from AOL for US$1 Billion Among the patents are semantic search and metadata querying – older than Google http://www.theregister.co.uk/2012/04/09/aol_micr osoft_patent_deal/
  11. 11. PURPOSE STATEMENT(SQL SERVER)
  12. 12. Goals Reduce the cost of managing all data Simplify the development of applications over all data Provide management and programming services for all data Make SQL Server the preferred choice for managing Unstructured Data and allow building Rich Application Experience on top(2012, Michael Rys, Microsoft)
  13. 13. http://msdn.microsoft.com/en-us/library/cc645577.aspxNEW IN SQL SERVER 2012
  14. 14. Statistical Semantic Search Identifies statistically relevant key phrases Based on these phrases, can identify (by score) similar documents
  15. 15. FileTables Built on existing SQL Server FILESTREAM technology Files and documents Stored in special tables in SQL Server Accessed if they were stored in the file system
  16. 16. Full-Text Search Enhancements Property search: search on tagged properties (such as author or title) Customizable NEAR: find words or phrases close to one another New Word Breakers and Stemmers (for many languages)
  17. 17. HOW DOES SEMANTICSEARCH WORK?
  18. 18. From Documents to Output Office Varchar PDF NVarchar Rowset Output with Scores
  19. 19. “Beyond Relational” vs. “Adoption” Start with unstructured (meaning non- relational) data Use Windows technology Reading and Writing Files (Win32 API) iFilters for reading proprietary formats Develop indexed structure from unstructured data
  20. 20. (iFilter Required) iFilters Full-Text Documents Keyword Index “FTI” Semantic Key Phrase Semantic Semantic Document Index – Database Similarity Index “DSI” Tag Index “TI”
  21. 21. “iFilter”? IFilters are components that allow search services to index content of specific file types, letting you search for content in those files. They are intended for use with Microsoft Search Services (Sharepoint, SQL, Exchange, Windows Search).
  22. 22. Microsoft Office 2010 Filters Pack Legacy Office Filter (97-2003; .doc, .ppt, .xls) Metro Office Filter (2007; .docx, .pptx, .xlsx) Zip Filter OneNote filter Visio Filter Publisher Filter Open Document Format Filter
  23. 23. Adobe PDF iFilter 9 for 64-bit platforms Allows PDF search Not currently supported for Windows 7 But I used it anyway ☺ Add the Bin directory to your path Computer (right click), Properties, Advanced System Settings, Environment Variables
  24. 24. “Semantic Language Statistics Database”? This database contains the statistical language models required by semantic search. A single semantic language statistics database contains the language models for all the languages that are supported for semantic indexing.
  25. 25. Languages Currently Supported Traditional Chinese German English French Italian Brazilian Russian Swedish Simplified Chinese British English Portuguese Chinese (Hong Kong SAR, PRC) Spanish Chinese (Singapore) Chinese (Macau SAR)
  26. 26. PERFORMANCE
  27. 27. Phases of Semantic Indexing Full Text Keyword Index “FTI” Semantic Document Similarity Index “DSI” Semantic Key Phrase Index – Tag Index “TI” http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing
  28. 28. Integrated Full Text Search (iFTS) Improved Performance and Scale: Scale-up to 350M documents for storage and search iFTS query performance 7-10 times faster than in SQL Server 2008 Worst-case iFTS query response times less than 3 sec for corpus Similar or better than main database search competitors (2012, Michael Rys, Microsoft)
  29. 29. Linear Scale of FTI/TI/DSI First known linearly scaling end-to-end Search and Semantic product in the industry Time in Seconds vs. Number of Documents (2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft)
  30. 30. Conclusion SQL Server 2012 adds new text processing capabilities This technology scales linearly Microsoft invites millions of documents for enterprise-level applications
  31. 31. Network MarkTab Consulting http://marktab.com Blog http://marktab.net Twitter @marktabnet
  32. 32. APPENDIX
  33. 33. References Video http://channel9.msdn.com/Shows/DataBound/DataBo und-Episode-2-Semantic-Search http://www.microsoftpdc.com/2009/SVR32 Semantic Search (Books Online) – explains the demo http://msdn.microsoft.com/en- us/library/gg492075.aspx Paper http://users.cis.fiu.edu/~lzhen001/activities/KDD2011 Program/docs/p213.pdf
  34. 34. Demo: My Semantic Search Sample http://mysemanticsearch.codeplex.com/ Requires: iFilters Semantic Language Statistics Database IIS7, IIS6, with Windows Authentication .NET 4.0 Silverlight 4.0 FILESTREAM (complete)
  35. 35. Demo: T-SQL and Documents Naveen Garg Requires Adventure Works (from Codeplex) http://blogs.msdn.com/b/sqlfts/archive/2011/0 7/21/introducing-fulltext-statistical-semantic- search-in-sql-server-codename-denali- release.aspx
  36. 36. Abstract SQL Server 2012 debuts a new Semantic Platform (commonly known as the applied task, Semantic Search). This text mining technology leverages the already established Full Text Index, and builds semantic indexes in a two-phase process. This presentation provides a science description and demo for the Enterprise implementation of Tag Index and Document Similarity Index. At present (RTM), the indexes work for 15 languages. Included are strategy tips for how to best leverage the technology along with already-existing Microsoft text mining and data mining.

×