SQLBits X SQL Server 2012 Rich Unstructured Data

1,879 views
1,735 views

Published on

SQLBits X Training Day Presentation on SQL Server 2012 FileStream, FileTable, FullText Search and Semantic Search

Copyright (c) Microsoft Corp.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,879
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
53
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • SQL 2008 provides Filestreams as a way add large blobs/unstructured data streams into SQL and still be able to open a Win32 handle (using SQL API) and provide high streaming performance for the data Win32 Namespace support in SQL Server 2012 has the following goals Reduce the barrier to entry for customers who have data in file servers and have Win32 applications that work on these currently. By enabling Win32 namespace, SQL will generate Windows Share that can be exposed to existing Win32 applications similar to any file server shares. This can allow Win32 applications/mid tier servers (like IIS) to work with this data without having to understand the database/transaction semantics Single integrated set of Admin tools – SQL backup/restore, Replication, HA solutions etc Scale up – Add multiple disks on a machine for storing Filestream data. Use SQL services like Full text search for both FileStream and relational metadata, Property Promotion Infrastructure fro extracting interesting properties from SQL blobs/filestream to surface as relational columns for query
  • RBS API is exposed in RBS client library.
  • Blob id is generated after close.Now the app can stored the blob id in the RBS column.
  • To get the tran context, you need a transaction. This is a SQL tran.
  • We are reading from SqlFilestream and writing the bytes read into the output buffer.
  • URI: HealthCare.MRI.JoeSmithApplication::GetResourceStream Method : Returns a resource stream for a resource data file that is located at the specified UriWriting into a SqlFileStream: We use a buffer that we read into it and write from it.Fileoption: 0 => Default: buffered reads, no write through. Because no write through, might be in some cases, a bit faster.Native shipped first, we wanted client filestream code to be aggressive with flushing the cached writes.Manages sqlfilestream class shipped sometime after the native API.=========================If the file access is readwrite handle of SqlFilestream will be positioned at the beginning of the file. System.io.seek methods to move the handle..
  • Reading bigger buffers gives better performance FS volumeDedicated volumes means volumes not used for tempdb (non-OS, paging, SQL data & log volumes)If stored files are large as we generally recommend, format with 64K clustersDo compress filestream volumes or filestream containers, but ONLY if data to be stored is compressible. Note that in this case NTFS cluster size must be 4K.1 vol per container => enables space management at volume level.AV should be configured not to delete infected files but to quarantine them. Otherwise corruption will be reported.SMBWith 60KB: A read can happen in one single IO and ideally coming back in one single TCP-IP packet. It is not 64K because 64KB data can't fit in one single TCP/IP buffer.Partitioning:FILESTREAM columns require the presence of the ROWGUID unique index for aligned partitioning, or in case this is not possible, explicitly specifying the data placement option for the unique or primary key constraint on the ROWGUID column.
  • Optimized hot paths, removed unnecessary serialization, expensive FileSystem operations etc
  • Not first extraction; another instanceEach has specialty syntaxUser has to just know, and rememberBetter to have one construct for all extraction-related BR services
  • Expose this data to usersCustomize: Don’t want fancy relationship, just sharing concepts!
  • In all examples: choose value, choose storageImagine IntelliSense: start typing, here’s the value!
  • SQLBits X SQL Server 2012 Rich Unstructured Data

    1. 1. Make SQL Server the preferred choice for managingUnstructured Data and allow building Rich ApplicationExperience on top
    2. 2. Scale Up for storage and search to 100m to 500m documentsEasy use/access to Unstructured data from all applicationsRich insight into unstructured data to make better decisions
    3. 3. Transactional Access Streaming Win32 Access Streaming Win32 Access?? Database Applications Windows Apps SQL Apps Blobs SMB Share FileStream Files/Folders API Rich Services Fulltext Search Database Solutions Scale-upSemantic Similarity Disk Disk Disk FileTable 1 2 3 FileStreams Search Multiple Containers Integrated Administration? Integrated Administration Remote BLOB Storage Customer Application SQL RBS API D D Centera SQL B FileStre Azure lib lib FILESTREAM lib B FileStreams Integrated Azure Centera SQL DB Backup/Replication/AlwaysOn
    4. 4. Machine Boundary 1 Write BLOB(Photo) Application 2 Return Blob ID 2 RBS Client RBS 3 Write Blob ID to Library Services: PhotoRef field • Create BLOB Store • Fetch Provider Library • GC • Delete ClaimID ClaimDate PhotoRef1 3 4390 6/5/2007 <Binary(20)>BLOB Store SQL Server
    5. 5. // Store a new blob.byte[] myBlobId;SqlRemoteBlobContext blobContext = new SqlRemoteBlobContext(sqlConn);using (SqlRemoteBlob newBlob = blobContext.CreateNewBlob()) { // Write to a System.IO.Stream object. newBlob.Write(…); newBlob.Close(); myBlobId = newBlob.BlobId;}// Alternative way to write.newBlob.WriteFromStream(inputStream);
    6. 6. // Add a new row including the blob ID to the database// table.// Fetch the blob.using (SqlRemoteBlob existingBlob = blobContext.OpenBlob(myBlobId)) { // Read from System.IO.Stream object. existingBlob.Read(...);}// Alternative way to read.existingBlob.ReadToStream(outputStream);
    7. 7. Store BLOBs inDB + File System Application BLOB DB
    8. 8. // New TSQL Function:// Get_filestream_transaction_context()//SELECT Get_filestream_transaction_context()// New TSQL Function :// PathName()//SELECT ClaimImage.PathName()FROM Insurancedb..Claims
    9. 9. // New SqlFileStream Class in VS08 SP1//SqlFileStream sfs = new SqlFileStream(path, txnId, System.IO.FileAccess.Read);// output file to read intoSystem.IO.FileStream fs = new System.IO.FileStream ("c:output2.jpg", System.IO.FileMode.Create);{ byte[] buffer = new byte[512 * 1024]; int cbBytesRead = buffer.Length; while (cbBytesRead == buffer.Length) { cbBytesRead = sfs.Read(buffer, 0, buffer.Length); fs.Write(buffer, 0, cbBytesRead); }}
    10. 10. sfs SqlFileStream sfs.Write// commit SQL transaction and close SQL connection.
    11. 11. FileTable Folder HierarchyFILESTREAMShare MSSQLSERVER my_machineDatabase MSSQLSERVEROfficeDirectories DocsDocuments Private Docs Office Docs (Database1) (Database2)FileTableDirectories Media Documents LogFiles (FileTable) (FileTable) (FileTable)User-DefinedDirectoryStructure
    12. 12. ALTER DATABASE Contoso SET FILESTREAM( non_transacted_access=FULL, Directory_name = N’Contoso’)CREATE TABLE Contoso..Documents AS FILETABLE WITH (filetable_directory = NDocument Library) <machine name><FILESTREAM share>ContosoDocument Library
    13. 13. FileTable Schema File Attribute Name Type Purpose Path_locator hierarchyid Represents position of this node in the hierarchical FileNamespace. parent_path_locator hierarchyid Represents the hierarchyID of the parent directory -- a computed column stream_id uniqueidentifier UniqueId for Filestream Data file_stream varbinary(max) filestream Filestream data file_type nvarchar(255) Type of the file. Can be used for fulltext index creation cached_file_size bigint Size of the filestream (cached value) Name nvarchar(255) File/Folder Name (e.g foo.txt) creation_time datetime2 Creation Time last_write_time datetime2 LastWrite Time last_access_time datetime2 LastAccess Time is_directory bit TRUE for directories. is_offline bit Offline attribute is_hidden bit Hidden attribute is_readonly bit Read Only attribute is_archive bit Archive attribute is_system bit System attribute is_temporary bit Temporary attribute
    14. 14. ALTER TABLE Documents DISABLE FILETABLE_NAMESPACE
    15. 15. machine<FILESTREAMshare><Database_directory><FileTable_Directory>...
    16. 16. GetFileNamespacePath()FileTableRootPath()GetPathlocator()
    17. 17. DECLARE @path nvarchar(max)// get FileNamespace pathSELECT @path=file_stream.GetFileNamespacePath()FROM DocumentStore WHERE name=MySpec.doc;// Open File handlehandle = CreateFile( @path, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS,FILE_ATTRIBUTE_NORMAL, NULL);
    18. 18. VNNSharedb
    19. 19. sys.dm_filestream_non_transact_handlessp_kill_filestream_non_transacted_handles
    20. 20. Create/Alter Database max_sizeDBCC Shrinkfile Emptyfile
    21. 21. Use of multiple spindles for achieving better I/O Scalability
    22. 22. 2012 2012
    23. 23. File Stores / SQL BLOBs Remote Blob External Blob FILESTREAM FILETABLE API Stores (CAS) Depends on Depends onStreaming Performance external store external store Depends on Depends on Win32 App Compat external store external storeLink Level ConsistencyData Level Consistency Integrated Query & ManagementNon-local Windows File n/a Servers External Blob Stores n/a
    24. 24. Features FileServer+DB SQL 2008– SQL 2012– Solution FILESTREAM FileTableIntegrated Admin operations for Relational and File No Yes Yesdata- Backup/Restore, HA/MirroringIntegrated Services for Relational and File data No Yes Yes- Tex/Semantic Search, Reports, Query etcIntegrated Security Model No Yes YesIn-place update of Filestream data Yes No Yes(non-transacted)Fully Transacted update of Filestream data No Yes YesFile/Directory hierarchy in DB No No YesWin32 App compatibility Yes No YesRelational access to File Attributes No No Yes
    25. 25. Queries over 350M documents database and random DMLs running in background.Beating SQL Server 2005 with a scale factor more than 2x and with avg 60x times better throughput
    26. 26. 2005/8 vs 2012 2005/8 2012Query avgExecTime (ms) under various number of connections (50 ~ 2000 users) for customerplayback benchmark
    27. 27. New Search Filter for Document Properties CONTAINS (PROPERTY ( { column_name }, property_name ), ‘contains_search_condition’ )
    28. 28. Source Table Keyphrases KeyphraseDocuments -------------- Key Title Document -------------- ID Keyword ID DocID D1 Annual Budget … -------------- -------------- -------------- T1 revenue T1 (revenue) D1 (Annual Budget) D2 Corporate Earnings … -------------- -------------- -------------- -------------- T2 growth T2 (growth) D2 (Corporate Earnings) D3 Marketing Reports … -------------- -------------- T3 Windows T3 (Windows) D3 (Marketing Reports) -------------- -------------- … … … T4 Azure -------------- … … -------------- … … T1 (revenue) D7 (Finance Report) 1 … … Full-Text and Semantic Processing T3 (Windows) D11 (Azure Strategy) quarter, record, T4 (Azure) D11 (Azure Strategy) revenue… 3 DocumentSimilarity 2 aKeyword Index (Full-Text) DocID MatchedDocIDID Keyword Colid … compDocid CompOc CompPid D1 (Annual Budget) D2 (Corporate Earnings)K1 revenue 1 … 10,23,123 (1,4),(5,8),(1,34) 2,5,6,8,4,3 D1 (Annual Budget) D7 (Finance Report)K2 growth 1 … 10,23,123 (1,5),(5,9),(1,34) 2,5,6,8,5,4 D3 (Marketing Reports) D11 (Azure Strategy) … … … … … … … …
    29. 29. CREATE FULLTEXT INDEX ON Production.Document ( ALTER FULLTEXT INDEX ON Production.Document Title LANGUAGE 1033, ALTER COLUMN Document Document ADD STATISTICAL_SEMANTICS LANGUAGE 1033 WITH NO POPULATION; TYPE COLUMN FileExtension STATISTICAL_SEMANTICS … ) … KEY INDEX PK_Document_DocumentID ALTER FULLTEXT INDEX ON Production.Document ON documents_catalog START FULL POPULATION; WITH CHANGE_TRACKING OFF, NO POPULATION;

    ×