Applied enterprise semantic mining

622 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
622
On SlideShare
0
From Embeds
0
Number of Embeds
28
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Applied enterprise semantic mining

  1. 1. Mark Tabladillo Ph.D. Data Mining Scientist MarkTab Inc.Applied EnterpriseSemantic MiningT E X T M I N I NG W I T H S Q L S E RVER 2 0 1 2P R ESENTED AT AT L A NTA M I CROS OFT BU S I N ESS I N T EL LIGENCE G ROU PJA N UA RY 2 8 , 2 0 1 3 ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  2. 2. About MarkTabhttp://marktab.comhttp://marktab.net @MarkTabNet ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  3. 3. IntroductionSQL Server 2012 has new Programmability Enhancements ◦ Statistical Semantic Search ◦ File Tables ◦ Full-Text Search ImprovementsThese combined technologies make SQL Server 2012 a strong contender in text mining ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  4. 4. ChallengesBuilding and Maintaining Applications with relational and non-relational data is hard ◦ Complex integration ◦ Duplicated functionality ◦ Compensation for unavailable services80% of all data is not stored in databases!Most of it is “unstructured”(2012, Michael Rys, Microsoft) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  5. 5. Microsoft and Google ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  6. 6. HistoryJuly 2008 ◦ Microsoft purchases Powerset for US$100 Million ◦ Google Dismisses Semantic Search ◦ http://venturebeat.com/2008/06/26/microsoft-to-buy-semantic-search-engine-powerset-for-100m- plus/ ◦ http://www.forbes.com/2008/07/01/powerset-msft-search-tech-intel-cx_ag_0701powerset.html ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  7. 7. HistoryMarch 2009◦ Google announces “snippets” as relevant to search◦ The media picks this story up as “semantic search”◦ http://googleblog.blogspot.com/2009/03/two-new-improvements-to-google- results.html#!/2009/03/two-new-improvements-to-google-results.html ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  8. 8. HistoryFebruary 2012◦ Google announces Knowledge Graph, an explicit application of semantic search◦ http://mashable.com/2012/02/13/google-knowledge-graph-change-search/ ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  9. 9. HistoryApril 2012 ◦ Microsoft purchases 800+ patents from AOL for US$1 Billion ◦ Among the patents are semantic search and metadata querying – older than Google ◦ http://www.theregister.co.uk/2012/04/09/aol_microsoft_patent_deal/ ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  10. 10. New in SQL Server 2012HT TP://MSDN.MICROSOFT.COM/EN -US/LIBRARY/CC645577.ASPX ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  11. 11. Goals of Semantic SearchReduce the cost of managing all dataSimplify the development of applications over all dataProvide management and programming services for all dataMake SQL Server the preferred choice for managing Unstructured Data and allow building RichApplication Experience on top(2012, Michael Rys, Microsoft) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  12. 12. Statistical Semantic SearchIdentifies statistically relevant key phrasesBased on these phrases, can identify (by score) similar documents ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  13. 13. FileTablesBuilt on existing SQL Server FILESTREAM technologyFiles and documents ◦ Stored in special tables in SQL Server ◦ Accessed if they were stored in the file system ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  14. 14. Full-Text Search EnhancementsProperty search: search on tagged properties (such as author or title)Customizable NEAR: find words or phrases close to one anotherNew Word Breakers and Stemmers (for many languages) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  15. 15. From Documents to Output Office Varchar PDF NVarchar Rowset Output with Scores ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  16. 16. “Beyond Relational” vs. “Adoption”Start with unstructured (meaning non-relational) dataUse Windows technology ◦ Reading and Writing Files (Win32 API) ◦ iFilters for reading proprietary formatsDevelop indexed structure from unstructured data ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  17. 17. (iFilter Required) iFilters Full-Text Documents Keyword Index “FTI” Semantic Key Phrase Semantic Index – Semantic Document Database Tag Index Similarity Index “DSI” “TI” ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  18. 18. “iFilter”?IFilters are components that allow search services to index content of specific file types, lettingyou search for content in those files.They are intended for use with Microsoft Search Services (SharePoint, SQL, Exchange, WindowsSearch). ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  19. 19. Microsoft Office 2010 Filters PackLegacy Office Filter (97-2003; .doc, .ppt, .xls)Metro Office Filter (2007; .docx, .pptx, .xlsx)Zip FilterOneNote filterVisio FilterPublisher FilterOpen Document Format Filter ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  20. 20. Adobe PDF iFilter 9 for 64-bit platformsAllows PDF searchNot currently supported for Windows 7 or 8 ◦ But I used it anyway Add the Bin directory to your path ◦ Computer (right click), Properties, Advanced System Settings, Environment Variables ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  21. 21. “Semantic Language StatisticsDatabase”?This database contains the statistical language models required by semantic search.A single semantic language statistics database contains the language models for all thelanguages that are supported for semantic indexing. ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  22. 22. Languages Currently SupportedTraditional ChineseGermanEnglishFrenchItalianBrazilianRussianSwedishSimplified ChineseBritish EnglishPortugueseChinese (Hong Kong SAR, PRC)SpanishChinese (Singapore)Chinese (Macau SAR) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  23. 23. Phases of Semantic Indexing Full Text Keyword Index “FTI” Semantic Document Similarity Index “DSI” Semantic Key Phrase Index – Tag Index “TI” http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  24. 24. Performance ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  25. 25. Integrated Full Text Search (iFTS)Improved Performance and Scale: ◦ Scale-up to 350M documents for storage and search ◦ iFTS query performance 7-10 times faster than in SQL Server 2008 ◦ Worst-case iFTS query response times less than 3 sec for corpus ◦ Similar or better than main database search competitors(2012, Michael Rys, Microsoft) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  26. 26. Linear Scale of FTI/TI/DSIFirst known linearly scaling end-to-end Search and Semantic product in the industry Time in Seconds vs. Number of Documents (2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  27. 27. ConclusionSQL Server 2012 adds new text processing capabilitiesThis technology scales linearlyMicrosoft invites millions of documents for enterprise-level applications ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  28. 28. NetworkMarkTab Consulting ◦ http://marktab.comBlog ◦ http://marktab.netTwitter ◦ @marktabnet ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  29. 29. Appendix ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  30. 30. ReferencesVideo ◦ http://channel9.msdn.com/Shows/DataBound/DataBound-Episode-2-Semantic-Search ◦ http://www.microsoftpdc.com/2009/SVR32Semantic Search (Books Online) – explains the demo ◦ http://msdn.microsoft.com/en-us/library/gg492075.aspxPaper ◦ http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p213.pdf ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  31. 31. Demo: My Semantic Search Samplehttp://mysemanticsearch.codeplex.com/Requires: ◦ iFilters ◦ Semantic Language Statistics Database ◦ IIS7, IIS6, with Windows Authentication ◦ .NET 4.0 ◦ Silverlight 4.0 ◦ FILESTREAM (complete) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  32. 32. Demo: T-SQL and DocumentsNaveen GargRequires Adventure Works (from Codeplex)http://blogs.msdn.com/b/sqlfts/archive/2011/07/21/introducing-fulltext-statistical-semantic-search-in-sql-server-codename-denali-release.aspx ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  33. 33. AbstractSQL Server 2012 debuts a new Semantic Platform (commonly known as the Semantic Searchapplied task). This text mining technology leverages the already established Full Text Index andbuilds semantic indexes in a two-phase process. This sessions detailed description and demogive you important information for the enterprise implementation of Tag Index and DocumentSimilarity Index. The demo is a web-based Silverlight application showing how to interactivelyuse semantic search. Currently, the indexes work for 15 languages. Well also look at strategytips for how to best leverage the new semantic technology with existing Microsoft text and datamining functionality. ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

×