FileTable and Semantic Search in SQL Server 2012


SQL Saturday 109 Presentation on FileTable and Semantic Search in SQL Server 2012

  1. 1. FILETABLE AND SEMANTIC SEARCH IN SQLSERVER 2012Michael RysPrincipal Program ManagerMicrosoft Corp@SQLServerMike© 2012 Microsoft
  2. 2. MY FAVORITE BEYOND RELATIONAL APPLICATION Structured and unstructured Search Related/”Semantic” Search
  3. 3. BEYOND RELATIONAL DATA Building and Maintaining Applications with relational and non-relational data is hard Pain Complex integration Duplicated functionality Points Compensation for unavailable services Reduce the cost of managing all data Simplify the development of applications Goals over all data Provide management and programming services for all data
  4. 4. RICH UNSTRUCTURED DATA IN SQL SERVER 2012• 80% of all data is not stored in databases! Most of it is “unstructured”• Make SQL Server the preferred choice for managing Unstructured Data and allow building Rich Application Experience on top• Address important customer requests for Capabilities and rich services for Rich Unstructured Data (RUDS) o Scale Up for storage and search to 100mio to 500mio documents o Easy use/access to Unstructured data from all applications o Rich insight into unstructured data to make better decisions
  5. 5. DEMOTeaser: MySemanticSearch
  6. 6. RICH UNSTRUCTURED DATA & SERVICES ECOSYSTEM Transactional Access Streaming Win32 Access Streaming Win32 Access?? Database Applications Windows Apps SQL Apps Blobs SMB Share FileStream Files/Folders API Rich Services Fulltext Search Database Solutions Scale-up Semantic Similarity FileTable Disk1 Disk2 Disk3 FileStreams Search Multiple Containers Integrated Administration? Integrated Administration Remote BLOB Storage Customer Application SQL RBS API DB Centera SQL FILESTREAM DB FileStre Azure lib lib lib FileStreams Integrated Azure Centera SQL DB Backup/Replication/AlwaysOn
  7. 7. DEMOIntegrated Management of documents in SQL Server 2012
  8. 8. FILETABLE OVERVIEW• FileTable: A Table of Files/Directories FileTable Folder Hierarchy • User created Table with a fixed schema • contains FILESTREAM and File Attributes FILESTREAM Share MSSQLSERVER • Each row represents a File or a Directory my_machineMSSQLSERVER • System defined constraints maintain the tree Database Office DocsDocuments integrity Directories Private Docs Office Docs (Database1) (Database2)• File/Directory hierarchy view through a Windows Share FileTable Directories Media Documents LogFiles • Supports Win32 APIs for File/Directory (FileTable) (FileTable) (FileTable) Management User-Defined • DB Storage is Transparent to Win32 applications Directory Structure • SMB level of application compatibility • Virtual network name (VNN) path support for transparent Win32 application failover
  9. 9. CREATING A FILETABLE Pre-requisites Enable FILESTREAM Create FILESTREAM Share and Filegroup Enable non-transactional access at the DB level ALTER DATABASE Contoso SET FILESTREAM( non_transacted_access=FULL, Directory_name = N’Contoso’) Create FileTable CREATE TABLE Contoso..Documents AS FILETABLE WITH (filetable_directory = NDocument Library) Access at <machine name><FILESTREAM share>ContosoDocument Library
  10. 10. MODIFYING A FILETABLE FileTable has a fixed schema Columns, system defined constraints cannot be altered/dropped Allows user defined indexes/constraints/triggers Disabling/Enabling FileTable Namespace ALTER TABLE Documents DISABLE FILETABLE_NAMESPACE Disables all system-defined constraints and Win32 access to FileTable Useful for bulk-loading/re-organization of data FileTable can be dropped similar to any other table Catalog views can be used for obtaining metadata
  11. 11. DATA ACCESS – FILE SYSTEM ACCESS FileTable hierarchy is visible through Filestream share machine<FILESTREAMshare><Database_directory><FileTable_Directory>... Provides transparent Win32 API & File/Directory Management capabilities e.g. MS word can create/open/save files; xcopy for copying directory trees into database.. Win32 API operations are non-transactional Operations cannot be part of any user transactions Win32 operations are intercepted by SQL Server at the File system level e.g. File/Directory creation/deletion => insert/delete into FileTable Full locking/concurrency semantics with other accesses Allows in-place update of file stream data/File attributes Transactional FILESTREAM APIs can also be used.
  12. 12. DATA ACCESS – T-SQL ACCESS Normal Insert/Update/Delete allowed for the FileTable manipulation FileTable Namespace integrity constraints enforced Set based operations on the File-attributes – value add Built-in functions GetFileNamespacePath() – UNC path for a file/directory FileTableRootPath() – UNC path to the FileTable root GetPathlocator() – path_locator value for a file/directory DDL/DML Triggers are supported DML triggers on a FileTable cannot update any FileTables
  13. 13. MANAGING FILETABLE DB Backup/Restore operations include FileTable data Point in time Restore‟ may contain more recent FILESTREAM data due to non-transactional updates during backup FileTables are secured similar to any other user tables Same security is enforced for Win32 access also Data Loading Windows tools like xcopy/robocopy OR drag-drop operations through Windows Explorer can be used BCP operations are supported for direct T-SQL data inserts SSMS supports FileTable creation/exploration
  14. 14. MANAGING FILETABLE – HIGH AVAILABILITYSQL Server 2012 AlwaysOn is fully supported Transparent data failover FileTables can be configured with multiple secondary nodes Both sync and async data replication is supported File and metadata is available in the secondary in case of failover Transparent application failover Virtual network name (VNN) path support for transparent Win32 application failover Applications use VNNSharedb... Path Applications are automatically redirected to the secondary in case of failover Restrictions FileTables cannot participate in “Read-only” replicas.
  15. 15. FILETABLE RESTRICTIONS FileTables cannot be partitioned Merge/Transactional replications are not supported RCSI/SnapShot isolation mode Applications cannot modify file stream data in FileTables Win32 Application compatibility Memory mapped files, Directory notifications, links are not supported
  16. 16. UNSTRUCTURED DATA SCALE-UPMULTIPLE CONTAINERS FOR FILESTREAM DATA SQL 2008 R2 Only one storage container/FILESTREAM filegroup Limits storage capacity scaling and I/O scaling SQL Server 2012 Support for multiple storage containers/filegroup. DDL Changes to Create/Alter Database statements Ability to set max_size for the containers DBCC Shrinkfile Emptyfile support Scaling Flexibility Storage scaling by adding additional storage drives I/O scaling with multiple spindles
  17. 17. UNSTRUCTURED DATA : MULTIPLE CONTAINERS Use of multiple spindles for achieving better I/O Scalability
  18. 18. RUDS SCALE-UP: FILESTREAM PERF/SCALE Improved performance of T-SQL and File I/O access Various enhancements to improve read/write throughput 5 fold increase in Read throughput Linear scaling with large number of concurrent threads 2012 2012
  19. 19. SUMMARY: FILETABLE Application Compatibility for Windows Applications Windows applications run on top of files stored in FileTables with no modifications Relational Value Proposition Provide Integrated Administration and Services Backup, Log Shipping, HA-DR, Full text and Semantic search, … T-SQL orthogonality File/Folder attributes surfaced through relational columns Power of set based operations, Policy Management, Reporting etc FileNamespace Hierarchy management
  20. 20. FULL TEXT SEARCH IMPROVEMENTS IN SQL SERVER 2012 Improved Performance and Scale: Scale-up to 350M documents iFTS query perf 7-10 times faster than in SQL Server 2008 Worst-case iFTS query response times < 3 sec for corpus At par or better than main database search competitors New Functionality: Property Search customizable NEAR New Wordbrakers: update existing WB, add Czech and Greek Innovation in Search: Semantic Similarity Search
  21. 21. FULLTEXT SEARCH PERFORMANCE & SCALE IMPROVEMENTS Architectural Improvements Improved internal implementation Queries no longer block Index updates Improved Query Plans: Better Plans for common queries Fulltext predicate folding Parallel Plan execution Index and Query tested on scale up to 350Million documents with <~2 Sec Response ~3X better w/o DML and ~9X better with DML throughput Scale easily with increasing number of connections
  22. 22. SCALE-UP: FULL-TEXT SEARCH 2005/8 vs 2012 2005/8 2012Queries over 350M documents database and random DMLs running in background.Beating SQL Server 2005 with a scale factor more than 2x and with avg 60x times better throughput
  23. 23. SCALE-UP: FULL-TEXT SEARCH 2005/8 vs 2012 2005/8 2012Query avgExecTime (ms) under various number of connections (50 ~ 2000 users) for customerplayback benchmark
  24. 24. FULLTEXT PROPERTY SCOPED SEARCHNew Search Filter for Document Properties CONTAINS (PROPERTY ( { column_name }, property_name ), „contains_search_condition‟ )• Setup once per database instance to load the office filters exec sp_fulltext_service load_os_resources,1 go exec sp_fulltext_service restart_all_fdhosts go• Create a property list CREATE SEARCH PROPERTY LIST p1;• Add properties to be extracted ALTER SEARCH PROPERTY LIST [p1] ADD NSystem.Author WITH (PROPERTY_SET_GUID = f29f85e0-4ff9-1068-ab91-08002b27b3d9, PROPERTY_INT_ID = 4, PROPERTY_DESCRIPTION = NSystem.Author);• Create/Alter Fulltext index to specify property list to be extracted ALTER FULLTEXT INDEX ON fttable... SET SEARCH PROPERTY LIST = [p1];• Query for properties SELECT * FROM fttable WHERE CONTAINS(PROPERTY(ftcol, System.Author), fernlope);
  25. 25. FULL-TEXT CUSTOMIZABLE NEAROLD NEAR SYNTAXselect * from fttable where contains(*, test near Space)NEW NEAR USAGES• SPECIFY DISTANCE select * from fttable where contains(*, near((test, Space), 5,false))• REDUCE DISTANCE select * from fttable where contains(*, near((test, Space), 2,false))• ORDER OF WORDS IS SPECIFIED AS IMPORTANT select * from fttable where contains(*, near((test, Space), 5,true))
  26. 26. STATISTICAL SEMANTIC SEARCH Semantic Insight into textual content Uses language models to find most important keywords in document No need to build brittle ontologies! Statistically Prominent Keywords Autogenerated tag clouds Potentially Related Content based on extracted Keywords, such as Similar Products (based on description) Similar Jobs or Applicants Similar Support Incidents (based on call logs) Potential Solutions (based on similar incidents) First class usage experience Efficent linear algorithms Integrated with FTS and SQL New Rowset functions for all results using SQL query
  27. 27. DEMOSemantic Extraction and RelationshipsFullText Search in SQL Server 2012
  28. 28. SEMANTIC SIMILARITY • Input: Text such as varchar, Office, PDF, HTML, email… Output: Rowset functions with standard SQL queries Illustrating example: Source Table Keyphrases KeyphraseDocuments -------------- Key Title Document -------------- ID Keyword ID DocID D1 Annual Budget … -------------- -------------- -------------- T1 revenue T1 (revenue) D1 (Annual Budget) D2 Corporate Earnings … -------------- -------------- -------------- -------------- T2 growth T2 (growth) D2 (Corporate Earnings) D3 Marketing Reports … -------------- -------------- T3 Windows T3 (Windows) D3 (Marketing Reports) -------------- -------------- … … … T4 Azure -------------- … … -------------- … … T1 (revenue) D7 (Finance Report) 1 … … Full-Text and Semantic Processing T3 (Windows) D11 (Azure Strategy) quarter, record, T4 (Azure) D11 (Azure Strategy) revenue… 3 DocumentSimilarity 2 aKeyword Index (Full-Text) DocID MatchedDocIDID Keyword Colid … compDocid CompOc CompPid D1 (Annual Budget) D2 (Corporate Earnings)K1 revenue 1 … 10,23,123 (1,4),(5,8),(1,34) 2,5,6,8,4,3 D1 (Annual Budget) D7 (Finance Report)K2 growth 1 … 10,23,123 (1,5),(5,9),(1,34) 2,5,6,8,5,4 D3 (Marketing Reports) D11 (Azure Strategy) … … … … … … … …
  29. 29. SEMANTIC EXTRACTION: END-2-END EXPERIENCE• Downloadable Language Statistical Database with registration stored procedure• Setup along with Full-Text• Metadata / Catalog views• System level DMVs for progress state and usage• Manageability through SSMS and SMO
  30. 30. KEY TAKEAWAYS SQL Server‟s unstructured data support is key strategy to enable you to build complex data applications that go beyond relational data! Content and Collaboration, eDiscovery, Healthcare, Document management etc.
