Sql Saturday 111 Atlanta applied enterprise semantic mining

Applied Enterprise Semantic
Mining
Mark Tabladillo Ph.D.
Data Mining Architect
SQL Saturday Atlanta April 14, 2012

About MarkTab

20+ Years in Atlanta
Consulting since 1998; Incorporated 2003
Part-Time Faculty at University of Phoenix
SAS and Microsoft Expert
Presenter since 1998 at conferences like TechEd
and SAS Global Forum
http://marktab.com @MarkTabNet

Introduction

SQL Server 2012 has new Programmability
Enhancements
Statistical Semantic Search
File Tables
Full-Text Search Improvements
These combined technologies make SQL
Server 2012 a strong contender in text mining

Challenges

Building and Maintaining Applications with
relational and non-relational data is hard
Complex integration
Duplicated functionality
Compensation for unavailable services
80% of all data is not stored in databases!
Most of it is “unstructured”

(2012, Michael Rys, Microsoft)

History

July 2008
Microsoft purchases Powerset for US$100 Million
Google Dismisses Semantic Search
http://venturebeat.com/2008/06/26/microsoft-to-
buy-semantic-search-engine-powerset-for-100m-
plus/
http://www.forbes.com/2008/07/01/powerset-msft-
search-tech-intel-cx_ag_0701powerset.html

History

March 2009
Google announces “snippets” as relevant to
search
The media picks this story up as “semantic
search”
http://googleblog.blogspot.com/2009/03/two-new-
improvements-to-google-
results.html#!/2009/03/two-new-improvements-to-
google-results.html

History

February 2012
Google announces Knowledge Graph, an explicit
application of semantic search
http://mashable.com/2012/02/13/google-
knowledge-graph-change-search/

History

April 2012
Microsoft purchases 800+ patents from AOL for
US$1 Billion
Among the patents are semantic search and
metadata querying – older than Google
http://www.theregister.co.uk/2012/04/09/aol_micr
osoft_patent_deal/

PURPOSE STATEMENT
(SQL SERVER)

Goals

Reduce the cost of managing all data
Simplify the development of applications over
all data
Provide management and programming
services for all data
Make SQL Server the preferred choice for
managing Unstructured Data and allow
building Rich Application Experience on top

http://msdn.microsoft.com/en-us/library/cc645577.aspx

NEW IN SQL SERVER 2012

Statistical Semantic Search

Identifies statistically relevant key phrases
Based on these phrases, can identify (by
score) similar documents

FileTables

Built on existing SQL Server FILESTREAM
technology
Files and documents
Stored in special tables in SQL Server
Accessed if they were stored in the file system

Full-Text Search Enhancements

Property search: search on tagged properties
(such as author or title)
Customizable NEAR: find words or phrases
close to one another
New Word Breakers and Stemmers (for many
languages)

HOW DOES SEMANTIC
SEARCH WORK?

From Documents to Output

Office

Varchar
PDF
NVarchar
Rowset
Output
with
Scores

“Beyond Relational” vs. “Adoption”

Start with unstructured (meaning non-
relational) data
Use Windows technology
Reading and Writing Files (Win32 API)
iFilters for reading proprietary formats
Develop indexed structure from unstructured
data

(iFilter Required)

iFilters Full-Text
Documents Keyword
Index
“FTI”

Semantic
Key
Phrase
Semantic
Semantic Document Index –
Database
Similarity Index “DSI” Tag Index
“TI”

“iFilter”?

IFilters are components that allow search
services to index content of specific file types,
letting you search for content in those files.
They are intended for use with Microsoft
Search Services (Sharepoint, SQL,
Exchange, Windows Search).

Microsoft Office 2010 Filters Pack

Legacy Office Filter (97-2003; .doc, .ppt, .xls)
Metro Office Filter (2007; .docx, .pptx, .xlsx)
Zip Filter
OneNote filter
Visio Filter
Publisher Filter
Open Document Format Filter

Adobe PDF iFilter 9 for 64-bit platforms

Allows PDF search
Not currently supported for Windows 7
But I used it anyway ☺
Add the Bin directory to your path
Computer (right click), Properties, Advanced
System Settings, Environment Variables

“Semantic Language Statistics Database”?

This database contains the statistical
language models required by semantic
search.
A single semantic language statistics
database contains the language models for
all the languages that are supported for
semantic indexing.

Languages Currently Supported
Traditional Chinese
German
English
French
Italian
Brazilian
Russian
Swedish
Simplified Chinese
British English
Portuguese
Chinese (Hong Kong SAR, PRC)
Spanish
Chinese (Singapore)
Chinese (Macau SAR)

Phases of Semantic Indexing

Full Text Keyword Index
“FTI”
Semantic Document
Similarity Index “DSI”
Semantic Key Phrase Index
–
Tag Index “TI”

http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing

Integrated Full Text Search (iFTS)

Improved Performance and Scale:
Scale-up to 350M documents for storage and
search
iFTS query performance 7-10 times faster than in
SQL Server 2008
Worst-case iFTS query response times less than
3 sec for corpus
Similar or better than main database search
competitors

Linear Scale of FTI/TI/DSI

First known linearly scaling end-to-end
Search and Semantic product in the industry

Time in Seconds vs. Number of Documents
(2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft)

Conclusion

SQL Server 2012 adds new text processing
capabilities
This technology scales linearly
Microsoft invites millions of documents for
enterprise-level applications

Network

MarkTab Consulting
http://marktab.com
Blog
http://marktab.net
Twitter
@marktabnet

References

Video
http://channel9.msdn.com/Shows/DataBound/DataBo
und-Episode-2-Semantic-Search
http://www.microsoftpdc.com/2009/SVR32
Semantic Search (Books Online) – explains the
demo
http://msdn.microsoft.com/en-
us/library/gg492075.aspx
Paper
http://users.cis.fiu.edu/~lzhen001/activities/KDD2011
Program/docs/p213.pdf

Demo: My Semantic Search Sample

http://mysemanticsearch.codeplex.com/
Requires:
iFilters
Semantic Language Statistics Database
IIS7, IIS6, with Windows Authentication
.NET 4.0
Silverlight 4.0
FILESTREAM (complete)

Demo: T-SQL and Documents

Naveen Garg
Requires Adventure Works (from Codeplex)
http://blogs.msdn.com/b/sqlfts/archive/2011/0
7/21/introducing-fulltext-statistical-semantic-
search-in-sql-server-codename-denali-
release.aspx

Abstract
SQL Server 2012 debuts a new Semantic Platform
(commonly known as the applied task, Semantic
Search). This text mining technology leverages the
already established Full Text Index, and builds
semantic indexes in a two-phase process. This
presentation provides a science description and
demo for the Enterprise implementation of Tag Index
and Document Similarity Index. At present (RTM),
the indexes work for 15 languages. Included are
strategy tips for how to best leverage the technology
along with already-existing Microsoft text mining and
data mining.

Sql Saturday 111 Atlanta applied enterprise semantic mining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Sql Saturday 111 Atlanta applied enterprise semantic mining

Similar to Sql Saturday 111 Atlanta applied enterprise semantic mining (20)

More from Mark Tabladillo

More from Mark Tabladillo (20)

Recently uploaded

Recently uploaded (20)

Sql Saturday 111 Atlanta applied enterprise semantic mining