Your SlideShare is downloading. ×
Sql Saturday 111 Atlanta applied enterprise semantic mining
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Sql Saturday 111 Atlanta applied enterprise semantic mining

1,196
views

Published on

SQL Server 2012 debuts a new Semantic Platform (commonly known as the applied task, Semantic Search). This text mining technology leverages the already established Full Text Index, and builds semantic …

SQL Server 2012 debuts a new Semantic Platform (commonly known as the applied task, Semantic Search). This text mining technology leverages the already established Full Text Index, and builds semantic indexes in a two-phase process. This presentation provides a science description and demo for the Enterprise implementation of Tag Index and Document Similarity Index. At present (RTM), the indexes work for 15 languages. Included are strategy tips for how to best leverage the technology along with already-existing Microsoft text mining and data mining.

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,196
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Applied Enterprise SemanticMiningMark Tabladillo Ph.D.Data Mining ArchitectSQL Saturday Atlanta April 14, 2012
  • 2. About MarkTab 20+ Years in Atlanta Consulting since 1998; Incorporated 2003 Part-Time Faculty at University of Phoenix SAS and Microsoft Expert Presenter since 1998 at conferences like TechEd and SAS Global Forum http://marktab.com @MarkTabNet
  • 3. Introduction SQL Server 2012 has new Programmability Enhancements Statistical Semantic Search File Tables Full-Text Search Improvements These combined technologies make SQL Server 2012 a strong contender in text mining
  • 4. PROBLEM STATEMENT
  • 5. Challenges Building and Maintaining Applications with relational and non-relational data is hard Complex integration Duplicated functionality Compensation for unavailable services 80% of all data is not stored in databases! Most of it is “unstructured”(2012, Michael Rys, Microsoft)
  • 6. MICROSOFT AND GOOGLE
  • 7. History July 2008 Microsoft purchases Powerset for US$100 Million Google Dismisses Semantic Search http://venturebeat.com/2008/06/26/microsoft-to- buy-semantic-search-engine-powerset-for-100m- plus/ http://www.forbes.com/2008/07/01/powerset-msft- search-tech-intel-cx_ag_0701powerset.html
  • 8. History March 2009 Google announces “snippets” as relevant to search The media picks this story up as “semantic search” http://googleblog.blogspot.com/2009/03/two-new- improvements-to-google- results.html#!/2009/03/two-new-improvements-to- google-results.html
  • 9. History February 2012 Google announces Knowledge Graph, an explicit application of semantic search http://mashable.com/2012/02/13/google- knowledge-graph-change-search/
  • 10. History April 2012 Microsoft purchases 800+ patents from AOL for US$1 Billion Among the patents are semantic search and metadata querying – older than Google http://www.theregister.co.uk/2012/04/09/aol_micr osoft_patent_deal/
  • 11. PURPOSE STATEMENT(SQL SERVER)
  • 12. Goals Reduce the cost of managing all data Simplify the development of applications over all data Provide management and programming services for all data Make SQL Server the preferred choice for managing Unstructured Data and allow building Rich Application Experience on top(2012, Michael Rys, Microsoft)
  • 13. http://msdn.microsoft.com/en-us/library/cc645577.aspxNEW IN SQL SERVER 2012
  • 14. Statistical Semantic Search Identifies statistically relevant key phrases Based on these phrases, can identify (by score) similar documents
  • 15. FileTables Built on existing SQL Server FILESTREAM technology Files and documents Stored in special tables in SQL Server Accessed if they were stored in the file system
  • 16. Full-Text Search Enhancements Property search: search on tagged properties (such as author or title) Customizable NEAR: find words or phrases close to one another New Word Breakers and Stemmers (for many languages)
  • 17. HOW DOES SEMANTICSEARCH WORK?
  • 18. From Documents to Output Office Varchar PDF NVarchar Rowset Output with Scores
  • 19. “Beyond Relational” vs. “Adoption” Start with unstructured (meaning non- relational) data Use Windows technology Reading and Writing Files (Win32 API) iFilters for reading proprietary formats Develop indexed structure from unstructured data
  • 20. (iFilter Required) iFilters Full-Text Documents Keyword Index “FTI” Semantic Key Phrase Semantic Semantic Document Index – Database Similarity Index “DSI” Tag Index “TI”
  • 21. “iFilter”? IFilters are components that allow search services to index content of specific file types, letting you search for content in those files. They are intended for use with Microsoft Search Services (Sharepoint, SQL, Exchange, Windows Search).
  • 22. Microsoft Office 2010 Filters Pack Legacy Office Filter (97-2003; .doc, .ppt, .xls) Metro Office Filter (2007; .docx, .pptx, .xlsx) Zip Filter OneNote filter Visio Filter Publisher Filter Open Document Format Filter
  • 23. Adobe PDF iFilter 9 for 64-bit platforms Allows PDF search Not currently supported for Windows 7 But I used it anyway ☺ Add the Bin directory to your path Computer (right click), Properties, Advanced System Settings, Environment Variables
  • 24. “Semantic Language Statistics Database”? This database contains the statistical language models required by semantic search. A single semantic language statistics database contains the language models for all the languages that are supported for semantic indexing.
  • 25. Languages Currently Supported Traditional Chinese German English French Italian Brazilian Russian Swedish Simplified Chinese British English Portuguese Chinese (Hong Kong SAR, PRC) Spanish Chinese (Singapore) Chinese (Macau SAR)
  • 26. PERFORMANCE
  • 27. Phases of Semantic Indexing Full Text Keyword Index “FTI” Semantic Document Similarity Index “DSI” Semantic Key Phrase Index – Tag Index “TI” http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing
  • 28. Integrated Full Text Search (iFTS) Improved Performance and Scale: Scale-up to 350M documents for storage and search iFTS query performance 7-10 times faster than in SQL Server 2008 Worst-case iFTS query response times less than 3 sec for corpus Similar or better than main database search competitors (2012, Michael Rys, Microsoft)
  • 29. Linear Scale of FTI/TI/DSI First known linearly scaling end-to-end Search and Semantic product in the industry Time in Seconds vs. Number of Documents (2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft)
  • 30. Conclusion SQL Server 2012 adds new text processing capabilities This technology scales linearly Microsoft invites millions of documents for enterprise-level applications
  • 31. Network MarkTab Consulting http://marktab.com Blog http://marktab.net Twitter @marktabnet
  • 32. APPENDIX
  • 33. References Video http://channel9.msdn.com/Shows/DataBound/DataBo und-Episode-2-Semantic-Search http://www.microsoftpdc.com/2009/SVR32 Semantic Search (Books Online) – explains the demo http://msdn.microsoft.com/en- us/library/gg492075.aspx Paper http://users.cis.fiu.edu/~lzhen001/activities/KDD2011 Program/docs/p213.pdf
  • 34. Demo: My Semantic Search Sample http://mysemanticsearch.codeplex.com/ Requires: iFilters Semantic Language Statistics Database IIS7, IIS6, with Windows Authentication .NET 4.0 Silverlight 4.0 FILESTREAM (complete)
  • 35. Demo: T-SQL and Documents Naveen Garg Requires Adventure Works (from Codeplex) http://blogs.msdn.com/b/sqlfts/archive/2011/0 7/21/introducing-fulltext-statistical-semantic- search-in-sql-server-codename-denali- release.aspx
  • 36. Abstract SQL Server 2012 debuts a new Semantic Platform (commonly known as the applied task, Semantic Search). This text mining technology leverages the already established Full Text Index, and builds semantic indexes in a two-phase process. This presentation provides a science description and demo for the Enterprise implementation of Tag Index and Document Similarity Index. At present (RTM), the indexes work for 15 languages. Included are strategy tips for how to best leverage the technology along with already-existing Microsoft text mining and data mining.