I F T S –  S Q L 2008  F T S  Engine
Upcoming SlideShare
Loading in...5
×
 

I F T S – S Q L 2008 F T S Engine

on

  • 1,735 views

ההרצאה של שי ממפגש ISUG מספר 87.

ההרצאה של שי ממפגש ISUG מספר 87.

Statistics

Views

Total Views
1,735
Views on SlideShare
1,676
Embed Views
59

Actions

Likes
0
Downloads
4
Comments
0

3 Embeds 59

http://www.sqlserver.co.il 42
http://www.valinor.co.il 11
http://www.slideshare.net 6

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

I F T S –  S Q L 2008  F T S  Engine I F T S – S Q L 2008 F T S Engine Presentation Transcript

  • iFTS – SQL 2008 FTS Engine Hebrew Full-Text Search In the real world
  • Hebrew in the real world
  • Agenda • Tapuz • iFTS – Introduction • iFTS – Terms and keywords • Setting up Full-Text • Index structure • Population • Querying • Improvements from 2005 • Tapuz solution • Known Issues
  • Tapuz – It’s all about content • 5 Major websites – Forums, Communa, Blogs, Flix (Video), Albums • Over 165 million content items • Over 3 million registered Users • Thousands of new items every day • More than 30 web servers • SQL Server: • SQL server 2005 enterprise edition on a 2-node Cluster • 4 quadcore CPU, 16 GB RAM • ~500 GB of data in 5 major databases • ~1200 batch requests per seconds
  • Tapuz - old search engines • 3 different search engines: ° 3 different database systems ° Search often didn’t return correct results ° 3 Different relevance sort algorithms ° Very resource intensive (more than 20 servers used for search alone!) ° No support for advanced search (dynamic fields) ° Long period of time before a new item is indexed
  • Tapuz Search - project requirements • Search through most of the existing content (more than 165M items) • Allow querying the new added items in real time • The search engine's default language is Hebrew and special linguistic characteristics should be supported • Dynamic fields search – the user can choose which fields to search • Should have a relevance sorting mechanism
  • Challenges • The search should add minimal load on the production SQL Server • Should have decent query performance • Real-time item indexing • How do we handle Hebrew ??!!*#$??!%?
  • The solution Transactional replication SQL 2008 Standard SQL 2005 Enterprise Cluster Auto Change tracking population
  • iFTS - Introduction • FTS allows fast and flexible indexing for keyword- based querying of text data • SQL Server has had full-text search capabilities since version 7.0 • The Full-Text Engine supports two roles: indexing and querying • Full-text indexes can be created not only on textual data columns, but also on binary columns • Common uses: searching Web sites, product catalogs, document management systems
  • Terms and Keywords Full-Text Catalog Document Population Full-Text Index
  • Terms and Keywords Full-Text Catalog Document Population Full-Text Index (also known as a crawl) - Population is the process of creating and maintaining a full-text index. (creating and building the index)
  • Terms and Keywords Population Filter Word breaker Stemmer
  • Terms and Keywords Population Given a specified file Filter extension such as .doc, filters extract text from a file stored in a Word breaker varbinary(max) or IMAGE column Stemmer
  • Terms and Keywords Population Filter For a given language, a word breaker tokenizes the text, identifies individual Word breaker words by determining where word boundaries Stemmer exist based on the lexical rules of the language
  • Terms and Keywords Population Filter For a given language, a stemmer Word generates inflectional breaker forms of a particular word based on the Stemmer rules of that language.
  • Terms and Keywords Population Token Filter Word breaker Stemmer
  • Terms and Keywords Population Token Filter Word breaker A token is a word or a Stemmer character string identified by a word breaker
  • Terms and Keywords Population Token Filter Word breaker STOPLIST Stemmer STOPWORD STOPWORD STOPWORD STOPWORD
  • Terms and Keywords Population Token A stopword is a word that Filter is not relevant to your search and is filtered out Word from indexing and query breaker processes. SQL Server 2008 introduces stoplists. STOPLIST Stemmer A stoplist is a list of stopwords STOPWORD STOPWORD STOPWORD STOPWORD
  • Terms and Keywords Full-Text Catalog Document Population Full-Text Index
  • Terms and Keywords Full-Text Catalog ADocument stores full-text index Population Full-Text information about Index significant words and their location within a given column
  • Terms and Keywords A full-text catalog is a Full-Text Catalog logical concept that Document refers to a group Population of Full-Text full-text indexes Index
  • Setting up Full-Text
  • Setting up Full-Text Creating a Full-Text index • A full-text index is a special type of token-based index • In order to create a full-text index on a table or a view, it must have a unique, single-column, non- nullable index • Can be created on columns of type: char, varchar, nchar, nvarchar, text, ntext, image, xml, varbinary, and varbinary(max) • Each index supports only a single language per column
  • Setting up Full-Text Creating a Full-Text index
  • Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 Nothing 3 1 3 Searching 3 1 5 •Demo
  • Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence The Keyword column contains a English 3 representation of a single token 1 7 Nothing 3 extracted at indexing time. Word 1 3 3 breakers determine what makes Searching 1 5 up a token. •Demo
  • Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 Nothing 3 1 3 Searching 3 1 5 •Demo
  • Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 The ColId column contains3a Nothing 3 1 value that corresponds to a Searching 3 1 5 particular column that is full- text indexed. •Demo
  • Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 Nothing 3 1 3 Searching 3 1 5 •Demo
  • Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 The DocId column contains Nothing 3 1 3 eight-byte integer values3 Searching 1 5 that maps to a particular full-text key value in a full- text indexed table. •Demo
  • Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 Nothing 3 1 3 Searching 3 1 5 •Demo
  • Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 The Occurrence column contains an Nothing 3 1 3 integer value. For each DocId value, Searchinga list of occurrence values 3 1 5 there is that correspond to the relative word offsets of the particular keyword within that DocId. •Demo
  • Population Process Population methods 1. Full – A full population builds index entries for all the rows of the base table or indexed view 2. Change Tracking – SQL server tracks changes to the base table since the last population: 1. Auto 2. Manual 3. Incremental Timestamp-Based Population
  • Querying • Contains, Freetext – as a predicate (Where) Syntax: CONTAINS (column_name,search_string) • ContainsTable, FreetextTable – TVF, includes ranking. Syntax: SELECT * FROM CONTAINSTABLE (table_name,column_name,search_string, top n)
  • iFTS enhancements in SQL Server 2008 • Fully integrated into SQL Server • Stoplists • New Tools for Troubleshooting SQL Server 2008 Full- Text Search (DMVs) • A New Word Breaker Family (Hebrew and other languages) • Performance improvements (reasons: Integer Key, full integration)
  • Hebrew???? • • • • • DEMOS -
  • New DMVs and management tools • Sys.dm_fts_parser • sys.dm_fts_index_keywords • sys.dm_fts_index_keywords_by_doc • sys.fulltext_index_fragments • FULLTEXTCATALOGPROPERTY: – MergeStatus – PopulateStatus • OBJECTPROPERTYEX: – TableFulltextPopulateStatus – TableFulltextPendingChanges
  • Tips and Tricks • Why Scan if you can…… FORCESEEK – new hint- can help a bit in determining the query plan • When using contains don’t forget to use quotes (“) if searching more than one word • Use to escape special characters • To search quotes (“) in the text use "
  • Tips and Tricks • Use an integer key as the Unique index • Place Full-Text index on another filegroup • Performance degrades when full text index is fragmented - use reorganize for merge
  • Tapuz Solution • SQL 2008 64bit standard edition, 16 GB RAM, 2 quadcore CPU • Transactional replication • FT indexes on different FG than the main tables • Change tracking (AUTO) • Daily reorganizing fragmented indexes only • Hierarchy set of queries to make sure relevance results return first • Use Dynamic SQL so that dynamic search fields can be used
  • Results relevance sorting logic • Freetext ranking (okapi –bm25) • Contains • Contains all words (using AND) • Free search (freetext)
  • Numbers • Index sizes – 53 GB (~68 GB Data) • Number of rows indexed – >165M • AVG search time – 1.7 Sec • More than 97% of the searches respond in less than 7 Sec • Number of searches (2 months) – more than 6 million • Number of connections – ~900
  • Known issues found so far • High CPU load and intense disk IO during queries • Population and merges are resource intensive • Ranking not as a TVF?? – impossible • Statistics, query plans and join types are not always optimal –hints can’t be used • No scale out or partitioning options
  • References • Books Online • SQL Server 2008 Full-Text Search: Internals and Enhancements: http://technet.microsoft.com/en- us/library/cc721269.aspx#_Toc202506227 • Pro Full-Text in SQL Server 2008 by Michael Coles