I F T S – S Q L 2008 F T S Engine

1,352 views

Published on

ההרצאה של שי ממפגש ISUG מספר 87.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,352
On SlideShare
0
From Embeds
0
Number of Embeds
63
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

I F T S – S Q L 2008 F T S Engine

  1. 1. iFTS – SQL 2008 FTS Engine Hebrew Full-Text Search In the real world
  2. 2. Hebrew in the real world
  3. 3. Agenda • Tapuz • iFTS – Introduction • iFTS – Terms and keywords • Setting up Full-Text • Index structure • Population • Querying • Improvements from 2005 • Tapuz solution • Known Issues
  4. 4. Tapuz – It’s all about content • 5 Major websites – Forums, Communa, Blogs, Flix (Video), Albums • Over 165 million content items • Over 3 million registered Users • Thousands of new items every day • More than 30 web servers • SQL Server: • SQL server 2005 enterprise edition on a 2-node Cluster • 4 quadcore CPU, 16 GB RAM • ~500 GB of data in 5 major databases • ~1200 batch requests per seconds
  5. 5. Tapuz - old search engines • 3 different search engines: ° 3 different database systems ° Search often didn’t return correct results ° 3 Different relevance sort algorithms ° Very resource intensive (more than 20 servers used for search alone!) ° No support for advanced search (dynamic fields) ° Long period of time before a new item is indexed
  6. 6. Tapuz Search - project requirements • Search through most of the existing content (more than 165M items) • Allow querying the new added items in real time • The search engine's default language is Hebrew and special linguistic characteristics should be supported • Dynamic fields search – the user can choose which fields to search • Should have a relevance sorting mechanism
  7. 7. Challenges • The search should add minimal load on the production SQL Server • Should have decent query performance • Real-time item indexing • How do we handle Hebrew ??!!*#$??!%?
  8. 8. The solution Transactional replication SQL 2008 Standard SQL 2005 Enterprise Cluster Auto Change tracking population
  9. 9. iFTS - Introduction • FTS allows fast and flexible indexing for keyword- based querying of text data • SQL Server has had full-text search capabilities since version 7.0 • The Full-Text Engine supports two roles: indexing and querying • Full-text indexes can be created not only on textual data columns, but also on binary columns • Common uses: searching Web sites, product catalogs, document management systems
  10. 10. Terms and Keywords Full-Text Catalog Document Population Full-Text Index
  11. 11. Terms and Keywords Full-Text Catalog Document Population Full-Text Index (also known as a crawl) - Population is the process of creating and maintaining a full-text index. (creating and building the index)
  12. 12. Terms and Keywords Population Filter Word breaker Stemmer
  13. 13. Terms and Keywords Population Given a specified file Filter extension such as .doc, filters extract text from a file stored in a Word breaker varbinary(max) or IMAGE column Stemmer
  14. 14. Terms and Keywords Population Filter For a given language, a word breaker tokenizes the text, identifies individual Word breaker words by determining where word boundaries Stemmer exist based on the lexical rules of the language
  15. 15. Terms and Keywords Population Filter For a given language, a stemmer Word generates inflectional breaker forms of a particular word based on the Stemmer rules of that language.
  16. 16. Terms and Keywords Population Token Filter Word breaker Stemmer
  17. 17. Terms and Keywords Population Token Filter Word breaker A token is a word or a Stemmer character string identified by a word breaker
  18. 18. Terms and Keywords Population Token Filter Word breaker STOPLIST Stemmer STOPWORD STOPWORD STOPWORD STOPWORD
  19. 19. Terms and Keywords Population Token A stopword is a word that Filter is not relevant to your search and is filtered out Word from indexing and query breaker processes. SQL Server 2008 introduces stoplists. STOPLIST Stemmer A stoplist is a list of stopwords STOPWORD STOPWORD STOPWORD STOPWORD
  20. 20. Terms and Keywords Full-Text Catalog Document Population Full-Text Index
  21. 21. Terms and Keywords Full-Text Catalog ADocument stores full-text index Population Full-Text information about Index significant words and their location within a given column
  22. 22. Terms and Keywords A full-text catalog is a Full-Text Catalog logical concept that Document refers to a group Population of Full-Text full-text indexes Index
  23. 23. Setting up Full-Text
  24. 24. Setting up Full-Text Creating a Full-Text index • A full-text index is a special type of token-based index • In order to create a full-text index on a table or a view, it must have a unique, single-column, non- nullable index • Can be created on columns of type: char, varchar, nchar, nvarchar, text, ntext, image, xml, varbinary, and varbinary(max) • Each index supports only a single language per column
  25. 25. Setting up Full-Text Creating a Full-Text index
  26. 26. Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 Nothing 3 1 3 Searching 3 1 5 •Demo
  27. 27. Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence The Keyword column contains a English 3 representation of a single token 1 7 Nothing 3 extracted at indexing time. Word 1 3 3 breakers determine what makes Searching 1 5 up a token. •Demo
  28. 28. Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 Nothing 3 1 3 Searching 3 1 5 •Demo
  29. 29. Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 The ColId column contains3a Nothing 3 1 value that corresponds to a Searching 3 1 5 particular column that is full- text indexed. •Demo
  30. 30. Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 Nothing 3 1 3 Searching 3 1 5 •Demo
  31. 31. Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 The DocId column contains Nothing 3 1 3 eight-byte integer values3 Searching 1 5 that maps to a particular full-text key value in a full- text indexed table. •Demo
  32. 32. Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 Nothing 3 1 3 Searching 3 1 5 •Demo
  33. 33. Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 The Occurrence column contains an Nothing 3 1 3 integer value. For each DocId value, Searchinga list of occurrence values 3 1 5 there is that correspond to the relative word offsets of the particular keyword within that DocId. •Demo
  34. 34. Population Process Population methods 1. Full – A full population builds index entries for all the rows of the base table or indexed view 2. Change Tracking – SQL server tracks changes to the base table since the last population: 1. Auto 2. Manual 3. Incremental Timestamp-Based Population
  35. 35. Querying • Contains, Freetext – as a predicate (Where) Syntax: CONTAINS (column_name,search_string) • ContainsTable, FreetextTable – TVF, includes ranking. Syntax: SELECT * FROM CONTAINSTABLE (table_name,column_name,search_string, top n)
  36. 36. iFTS enhancements in SQL Server 2008 • Fully integrated into SQL Server • Stoplists • New Tools for Troubleshooting SQL Server 2008 Full- Text Search (DMVs) • A New Word Breaker Family (Hebrew and other languages) • Performance improvements (reasons: Integer Key, full integration)
  37. 37. Hebrew???? • • • • • DEMOS -
  38. 38. New DMVs and management tools • Sys.dm_fts_parser • sys.dm_fts_index_keywords • sys.dm_fts_index_keywords_by_doc • sys.fulltext_index_fragments • FULLTEXTCATALOGPROPERTY: – MergeStatus – PopulateStatus • OBJECTPROPERTYEX: – TableFulltextPopulateStatus – TableFulltextPendingChanges
  39. 39. Tips and Tricks • Why Scan if you can…… FORCESEEK – new hint- can help a bit in determining the query plan • When using contains don’t forget to use quotes (“) if searching more than one word • Use to escape special characters • To search quotes (“) in the text use "
  40. 40. Tips and Tricks • Use an integer key as the Unique index • Place Full-Text index on another filegroup • Performance degrades when full text index is fragmented - use reorganize for merge
  41. 41. Tapuz Solution • SQL 2008 64bit standard edition, 16 GB RAM, 2 quadcore CPU • Transactional replication • FT indexes on different FG than the main tables • Change tracking (AUTO) • Daily reorganizing fragmented indexes only • Hierarchy set of queries to make sure relevance results return first • Use Dynamic SQL so that dynamic search fields can be used
  42. 42. Results relevance sorting logic • Freetext ranking (okapi –bm25) • Contains • Contains all words (using AND) • Free search (freetext)
  43. 43. Numbers • Index sizes – 53 GB (~68 GB Data) • Number of rows indexed – >165M • AVG search time – 1.7 Sec • More than 97% of the searches respond in less than 7 Sec • Number of searches (2 months) – more than 6 million • Number of connections – ~900
  44. 44. Known issues found so far • High CPU load and intense disk IO during queries • Population and merges are resource intensive • Ranking not as a TVF?? – impossible • Statistics, query plans and join types are not always optimal –hints can’t be used • No scale out or partitioning options
  45. 45. References • Books Online • SQL Server 2008 Full-Text Search: Internals and Enhancements: http://technet.microsoft.com/en- us/library/cc721269.aspx#_Toc202506227 • Pro Full-Text in SQL Server 2008 by Michael Coles

×