Applied enterprise semantic mining
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
629
On Slideshare
603
From Embeds
26
Number of Embeds
2

Actions

Shares
Downloads
7
Comments
0
Likes
0

Embeds 26

http://www.marktab.net 15
http://marktab.net 11

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Mark Tabladillo Ph.D. Data Mining Scientist MarkTab Inc.Applied EnterpriseSemantic MiningT E X T M I N I NG W I T H S Q L S E RVER 2 0 1 2P R ESENTED AT AT L A NTA M I CROS OFT BU S I N ESS I N T EL LIGENCE G ROU PJA N UA RY 2 8 , 2 0 1 3 ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 2. About MarkTabhttp://marktab.comhttp://marktab.net @MarkTabNet ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 3. IntroductionSQL Server 2012 has new Programmability Enhancements ◦ Statistical Semantic Search ◦ File Tables ◦ Full-Text Search ImprovementsThese combined technologies make SQL Server 2012 a strong contender in text mining ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 4. ChallengesBuilding and Maintaining Applications with relational and non-relational data is hard ◦ Complex integration ◦ Duplicated functionality ◦ Compensation for unavailable services80% of all data is not stored in databases!Most of it is “unstructured”(2012, Michael Rys, Microsoft) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 5. Microsoft and Google ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 6. HistoryJuly 2008 ◦ Microsoft purchases Powerset for US$100 Million ◦ Google Dismisses Semantic Search ◦ http://venturebeat.com/2008/06/26/microsoft-to-buy-semantic-search-engine-powerset-for-100m- plus/ ◦ http://www.forbes.com/2008/07/01/powerset-msft-search-tech-intel-cx_ag_0701powerset.html ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 7. HistoryMarch 2009◦ Google announces “snippets” as relevant to search◦ The media picks this story up as “semantic search”◦ http://googleblog.blogspot.com/2009/03/two-new-improvements-to-google- results.html#!/2009/03/two-new-improvements-to-google-results.html ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 8. HistoryFebruary 2012◦ Google announces Knowledge Graph, an explicit application of semantic search◦ http://mashable.com/2012/02/13/google-knowledge-graph-change-search/ ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 9. HistoryApril 2012 ◦ Microsoft purchases 800+ patents from AOL for US$1 Billion ◦ Among the patents are semantic search and metadata querying – older than Google ◦ http://www.theregister.co.uk/2012/04/09/aol_microsoft_patent_deal/ ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 10. New in SQL Server 2012HT TP://MSDN.MICROSOFT.COM/EN -US/LIBRARY/CC645577.ASPX ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 11. Goals of Semantic SearchReduce the cost of managing all dataSimplify the development of applications over all dataProvide management and programming services for all dataMake SQL Server the preferred choice for managing Unstructured Data and allow building RichApplication Experience on top(2012, Michael Rys, Microsoft) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 12. Statistical Semantic SearchIdentifies statistically relevant key phrasesBased on these phrases, can identify (by score) similar documents ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 13. FileTablesBuilt on existing SQL Server FILESTREAM technologyFiles and documents ◦ Stored in special tables in SQL Server ◦ Accessed if they were stored in the file system ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 14. Full-Text Search EnhancementsProperty search: search on tagged properties (such as author or title)Customizable NEAR: find words or phrases close to one anotherNew Word Breakers and Stemmers (for many languages) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 15. From Documents to Output Office Varchar PDF NVarchar Rowset Output with Scores ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 16. “Beyond Relational” vs. “Adoption”Start with unstructured (meaning non-relational) dataUse Windows technology ◦ Reading and Writing Files (Win32 API) ◦ iFilters for reading proprietary formatsDevelop indexed structure from unstructured data ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 17. (iFilter Required) iFilters Full-Text Documents Keyword Index “FTI” Semantic Key Phrase Semantic Index – Semantic Document Database Tag Index Similarity Index “DSI” “TI” ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 18. “iFilter”?IFilters are components that allow search services to index content of specific file types, lettingyou search for content in those files.They are intended for use with Microsoft Search Services (SharePoint, SQL, Exchange, WindowsSearch). ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 19. Microsoft Office 2010 Filters PackLegacy Office Filter (97-2003; .doc, .ppt, .xls)Metro Office Filter (2007; .docx, .pptx, .xlsx)Zip FilterOneNote filterVisio FilterPublisher FilterOpen Document Format Filter ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 20. Adobe PDF iFilter 9 for 64-bit platformsAllows PDF searchNot currently supported for Windows 7 or 8 ◦ But I used it anyway Add the Bin directory to your path ◦ Computer (right click), Properties, Advanced System Settings, Environment Variables ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 21. “Semantic Language StatisticsDatabase”?This database contains the statistical language models required by semantic search.A single semantic language statistics database contains the language models for all thelanguages that are supported for semantic indexing. ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 22. Languages Currently SupportedTraditional ChineseGermanEnglishFrenchItalianBrazilianRussianSwedishSimplified ChineseBritish EnglishPortugueseChinese (Hong Kong SAR, PRC)SpanishChinese (Singapore)Chinese (Macau SAR) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 23. Phases of Semantic Indexing Full Text Keyword Index “FTI” Semantic Document Similarity Index “DSI” Semantic Key Phrase Index – Tag Index “TI” http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 24. Performance ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 25. Integrated Full Text Search (iFTS)Improved Performance and Scale: ◦ Scale-up to 350M documents for storage and search ◦ iFTS query performance 7-10 times faster than in SQL Server 2008 ◦ Worst-case iFTS query response times less than 3 sec for corpus ◦ Similar or better than main database search competitors(2012, Michael Rys, Microsoft) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 26. Linear Scale of FTI/TI/DSIFirst known linearly scaling end-to-end Search and Semantic product in the industry Time in Seconds vs. Number of Documents (2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 27. ConclusionSQL Server 2012 adds new text processing capabilitiesThis technology scales linearlyMicrosoft invites millions of documents for enterprise-level applications ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 28. NetworkMarkTab Consulting ◦ http://marktab.comBlog ◦ http://marktab.netTwitter ◦ @marktabnet ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 29. Appendix ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 30. ReferencesVideo ◦ http://channel9.msdn.com/Shows/DataBound/DataBound-Episode-2-Semantic-Search ◦ http://www.microsoftpdc.com/2009/SVR32Semantic Search (Books Online) – explains the demo ◦ http://msdn.microsoft.com/en-us/library/gg492075.aspxPaper ◦ http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p213.pdf ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 31. Demo: My Semantic Search Samplehttp://mysemanticsearch.codeplex.com/Requires: ◦ iFilters ◦ Semantic Language Statistics Database ◦ IIS7, IIS6, with Windows Authentication ◦ .NET 4.0 ◦ Silverlight 4.0 ◦ FILESTREAM (complete) ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 32. Demo: T-SQL and DocumentsNaveen GargRequires Adventure Works (from Codeplex)http://blogs.msdn.com/b/sqlfts/archive/2011/07/21/introducing-fulltext-statistical-semantic-search-in-sql-server-codename-denali-release.aspx ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
  • 33. AbstractSQL Server 2012 debuts a new Semantic Platform (commonly known as the Semantic Searchapplied task). This text mining technology leverages the already established Full Text Index andbuilds semantic indexes in a two-phase process. This sessions detailed description and demogive you important information for the enterprise implementation of Tag Index and DocumentSimilarity Index. The demo is a web-based Silverlight application showing how to interactivelyuse semantic search. Currently, the indexes work for 15 languages. Well also look at strategytips for how to best leverage the new semantic technology with existing Microsoft text and datamining functionality. ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE