SQL Server 2012 introduces new semantic search capabilities that leverage its existing full text search. Semantic search builds semantic indexes in two phases, first creating a tag index and then a document similarity index. It can currently support indexing documents in 15 languages. The presentation provides details on how semantic search works and its performance, which scales linearly with document collection size. It also demonstrates how semantic search can help reduce the challenges of building applications with both structured and unstructured data.
2. About MarkTab
20+ Years in Atlanta
Consulting since 1998; Incorporated 2003
Part-Time Faculty at University of Phoenix
SAS and Microsoft Expert
Presenter since 1998 at conferences like TechEd
and SAS Global Forum
http://marktab.com @MarkTabNet
3. Introduction
SQL Server 2012 has new Programmability
Enhancements
Statistical Semantic Search
File Tables
Full-Text Search Improvements
These combined technologies make SQL
Server 2012 a strong contender in text mining
5. Challenges
Building and Maintaining Applications with
relational and non-relational data is hard
Complex integration
Duplicated functionality
Compensation for unavailable services
80% of all data is not stored in databases!
Most of it is “unstructured”
(2012, Michael Rys, Microsoft)
7. History
July 2008
Microsoft purchases Powerset for US$100 Million
Google Dismisses Semantic Search
http://venturebeat.com/2008/06/26/microsoft-to-
buy-semantic-search-engine-powerset-for-100m-
plus/
http://www.forbes.com/2008/07/01/powerset-msft-
search-tech-intel-cx_ag_0701powerset.html
8. History
March 2009
Google announces “snippets” as relevant to
search
The media picks this story up as “semantic
search”
http://googleblog.blogspot.com/2009/03/two-new-
improvements-to-google-
results.html#!/2009/03/two-new-improvements-to-
google-results.html
9. History
February 2012
Google announces Knowledge Graph, an explicit
application of semantic search
http://mashable.com/2012/02/13/google-
knowledge-graph-change-search/
10. History
April 2012
Microsoft purchases 800+ patents from AOL for
US$1 Billion
Among the patents are semantic search and
metadata querying – older than Google
http://www.theregister.co.uk/2012/04/09/aol_micr
osoft_patent_deal/
12. Goals
Reduce the cost of managing all data
Simplify the development of applications over
all data
Provide management and programming
services for all data
Make SQL Server the preferred choice for
managing Unstructured Data and allow
building Rich Application Experience on top
(2012, Michael Rys, Microsoft)
14. Statistical Semantic Search
Identifies statistically relevant key phrases
Based on these phrases, can identify (by
score) similar documents
15. FileTables
Built on existing SQL Server FILESTREAM
technology
Files and documents
Stored in special tables in SQL Server
Accessed if they were stored in the file system
16. Full-Text Search Enhancements
Property search: search on tagged properties
(such as author or title)
Customizable NEAR: find words or phrases
close to one another
New Word Breakers and Stemmers (for many
languages)
18. From Documents to Output
Office
Varchar
PDF
NVarchar
Rowset
Output
with
Scores
19. “Beyond Relational” vs. “Adoption”
Start with unstructured (meaning non-
relational) data
Use Windows technology
Reading and Writing Files (Win32 API)
iFilters for reading proprietary formats
Develop indexed structure from unstructured
data
20. (iFilter Required)
iFilters Full-Text
Documents Keyword
Index
“FTI”
Semantic
Key
Phrase
Semantic
Semantic Document Index –
Database
Similarity Index “DSI” Tag Index
“TI”
21. “iFilter”?
IFilters are components that allow search
services to index content of specific file types,
letting you search for content in those files.
They are intended for use with Microsoft
Search Services (Sharepoint, SQL,
Exchange, Windows Search).
22. Microsoft Office 2010 Filters Pack
Legacy Office Filter (97-2003; .doc, .ppt, .xls)
Metro Office Filter (2007; .docx, .pptx, .xlsx)
Zip Filter
OneNote filter
Visio Filter
Publisher Filter
Open Document Format Filter
23. Adobe PDF iFilter 9 for 64-bit platforms
Allows PDF search
Not currently supported for Windows 7
But I used it anyway ☺
Add the Bin directory to your path
Computer (right click), Properties, Advanced
System Settings, Environment Variables
24. “Semantic Language Statistics Database”?
This database contains the statistical
language models required by semantic
search.
A single semantic language statistics
database contains the language models for
all the languages that are supported for
semantic indexing.
25. Languages Currently Supported
Traditional Chinese
German
English
French
Italian
Brazilian
Russian
Swedish
Simplified Chinese
British English
Portuguese
Chinese (Hong Kong SAR, PRC)
Spanish
Chinese (Singapore)
Chinese (Macau SAR)
27. Phases of Semantic Indexing
Full Text Keyword Index
“FTI”
Semantic Document
Similarity Index “DSI”
Semantic Key Phrase Index
–
Tag Index “TI”
http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing
28. Integrated Full Text Search (iFTS)
Improved Performance and Scale:
Scale-up to 350M documents for storage and
search
iFTS query performance 7-10 times faster than in
SQL Server 2008
Worst-case iFTS query response times less than
3 sec for corpus
Similar or better than main database search
competitors
(2012, Michael Rys, Microsoft)
29. Linear Scale of FTI/TI/DSI
First known linearly scaling end-to-end
Search and Semantic product in the industry
Time in Seconds vs. Number of Documents
(2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft)
30. Conclusion
SQL Server 2012 adds new text processing
capabilities
This technology scales linearly
Microsoft invites millions of documents for
enterprise-level applications
33. References
Video
http://channel9.msdn.com/Shows/DataBound/DataBo
und-Episode-2-Semantic-Search
http://www.microsoftpdc.com/2009/SVR32
Semantic Search (Books Online) – explains the
demo
http://msdn.microsoft.com/en-
us/library/gg492075.aspx
Paper
http://users.cis.fiu.edu/~lzhen001/activities/KDD2011
Program/docs/p213.pdf
34. Demo: My Semantic Search Sample
http://mysemanticsearch.codeplex.com/
Requires:
iFilters
Semantic Language Statistics Database
IIS7, IIS6, with Windows Authentication
.NET 4.0
Silverlight 4.0
FILESTREAM (complete)
35. Demo: T-SQL and Documents
Naveen Garg
Requires Adventure Works (from Codeplex)
http://blogs.msdn.com/b/sqlfts/archive/2011/0
7/21/introducing-fulltext-statistical-semantic-
search-in-sql-server-codename-denali-
release.aspx
36. Abstract
SQL Server 2012 debuts a new Semantic Platform
(commonly known as the applied task, Semantic
Search). This text mining technology leverages the
already established Full Text Index, and builds
semantic indexes in a two-phase process. This
presentation provides a science description and
demo for the Enterprise implementation of Tag Index
and Document Similarity Index. At present (RTM),
the indexes work for 15 languages. Included are
strategy tips for how to best leverage the technology
along with already-existing Microsoft text mining and
data mining.