0
iFTS – SQL 2008 FTS Engine

    Hebrew Full-Text Search
       In the real world
Hebrew in the real world
Agenda
•   Tapuz
•   iFTS – Introduction
•   iFTS – Terms and keywords
•   Setting up Full-Text
•   Index structure
•   Po...
Tapuz – It’s all about content
• 5 Major websites – Forums, Communa, Blogs, Flix
  (Video), Albums
• Over 165 million cont...
Tapuz - old search engines
• 3 different search engines:
   ° 3 different database systems
   ° Search often didn’t return...
Tapuz Search - project requirements
• Search through most of the existing content (more
  than 165M items)
• Allow queryin...
Challenges
• The search should add minimal load on the
  production SQL Server
• Should have decent query performance
• Re...
The solution

                          Transactional replication




                                                  SQ...
iFTS - Introduction
• FTS allows fast and flexible indexing for keyword-
  based querying of text data
• SQL Server has ha...
Terms and Keywords


                        Full-Text Catalog

Document   Population      Full-Text
                     ...
Terms and Keywords


                            Full-Text Catalog

Document   Population          Full-Text
             ...
Terms and Keywords
                             Population


          Filter



 Word
breaker

                   Stemmer
Terms and Keywords
                             Population


                                     Given a specified file
 ...
Terms and Keywords
                             Population


          Filter
                                  For a give...
Terms and Keywords
                             Population


          Filter
                                            ...
Terms and Keywords
                             Population

                                          Token

          Fil...
Terms and Keywords
                             Population

                                                     Token

  ...
Terms and Keywords
                             Population

                                             Token

          ...
Terms and Keywords
                       Population

                                       Token

A stopword is a word t...
Terms and Keywords


                        Full-Text Catalog

Document   Population      Full-Text
                     ...
Terms and Keywords


                                     Full-Text Catalog
  ADocument stores
    full-text index     Pop...
Terms and Keywords


 A full-text catalog is a        Full-Text Catalog
  logical concept that
Document
  refers to a grou...
Setting up Full-Text
Setting up Full-Text
               Creating a Full-Text index
• A full-text index is a special type of token-based
  inde...
Setting up Full-Text
 Creating a Full-Text index
Index Structure
                    ID                       Text_English
Source Row:
                    1        there i...
Index Structure
                    ID                          Text_English
Source Row:
                    1            ...
Index Structure
                    ID                       Text_English
Source Row:
                    1        there i...
Index Structure
                    ID                       Text_English
Source Row:
                    1        there i...
Index Structure
                    ID                       Text_English
Source Row:
                    1        there i...
Index Structure
                          ID                        Text_English
Source Row:
                           1 ...
Index Structure
                    ID                       Text_English
Source Row:
                    1        there i...
Index Structure
                        ID                           Text_English
Source Row:
                         1  ...
Population Process
                  Population methods

1. Full – A full population builds index entries for all
   the r...
Querying
• Contains, Freetext – as a predicate (Where)
  Syntax:
  CONTAINS (column_name,search_string)
• ContainsTable, F...
iFTS enhancements in
              SQL Server 2008
• Fully integrated into SQL Server
• Stoplists
• New Tools for Troubles...
Hebrew????
              •
              •
              •
              •
              •

    DEMOS -
New DMVs and
            management tools
•  Sys.dm_fts_parser
• sys.dm_fts_index_keywords
• sys.dm_fts_index_keywords_by_...
Tips and Tricks
• Why Scan if you can……
  FORCESEEK – new hint- can help a bit in
  determining the query plan
• When usin...
Tips and Tricks
• Use an integer key as the Unique index
• Place Full-Text index on another filegroup
• Performance degrad...
Tapuz Solution
• SQL 2008 64bit standard edition, 16 GB RAM, 2
  quadcore CPU
• Transactional replication
• FT indexes on ...
Results relevance sorting logic
•   Freetext ranking (okapi –bm25)
•   Contains
•   Contains all words (using AND)
•   Fre...
Numbers
• Index sizes – 53 GB (~68 GB Data)
• Number of rows indexed – >165M
• AVG search time – 1.7 Sec
• More than 97% o...
Known issues found so far
• High CPU load and intense disk IO during queries
• Population and merges are resource intensiv...
References
• Books Online
• SQL Server 2008 Full-Text Search: Internals
  and Enhancements:
  http://technet.microsoft.com...
I F T S –  S Q L 2008  F T S  Engine
I F T S –  S Q L 2008  F T S  Engine
I F T S –  S Q L 2008  F T S  Engine
Upcoming SlideShare
Loading in...5
×

I F T S – S Q L 2008 F T S Engine

1,091

Published on

ההרצאה של שי ממפגש ISUG מספר 87.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,091
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "I F T S – S Q L 2008 F T S Engine"

  1. 1. iFTS – SQL 2008 FTS Engine Hebrew Full-Text Search In the real world
  2. 2. Hebrew in the real world
  3. 3. Agenda • Tapuz • iFTS – Introduction • iFTS – Terms and keywords • Setting up Full-Text • Index structure • Population • Querying • Improvements from 2005 • Tapuz solution • Known Issues
  4. 4. Tapuz – It’s all about content • 5 Major websites – Forums, Communa, Blogs, Flix (Video), Albums • Over 165 million content items • Over 3 million registered Users • Thousands of new items every day • More than 30 web servers • SQL Server: • SQL server 2005 enterprise edition on a 2-node Cluster • 4 quadcore CPU, 16 GB RAM • ~500 GB of data in 5 major databases • ~1200 batch requests per seconds
  5. 5. Tapuz - old search engines • 3 different search engines: ° 3 different database systems ° Search often didn’t return correct results ° 3 Different relevance sort algorithms ° Very resource intensive (more than 20 servers used for search alone!) ° No support for advanced search (dynamic fields) ° Long period of time before a new item is indexed
  6. 6. Tapuz Search - project requirements • Search through most of the existing content (more than 165M items) • Allow querying the new added items in real time • The search engine's default language is Hebrew and special linguistic characteristics should be supported • Dynamic fields search – the user can choose which fields to search • Should have a relevance sorting mechanism
  7. 7. Challenges • The search should add minimal load on the production SQL Server • Should have decent query performance • Real-time item indexing • How do we handle Hebrew ??!!*#$??!%?
  8. 8. The solution Transactional replication SQL 2008 Standard SQL 2005 Enterprise Cluster Auto Change tracking population
  9. 9. iFTS - Introduction • FTS allows fast and flexible indexing for keyword- based querying of text data • SQL Server has had full-text search capabilities since version 7.0 • The Full-Text Engine supports two roles: indexing and querying • Full-text indexes can be created not only on textual data columns, but also on binary columns • Common uses: searching Web sites, product catalogs, document management systems
  10. 10. Terms and Keywords Full-Text Catalog Document Population Full-Text Index
  11. 11. Terms and Keywords Full-Text Catalog Document Population Full-Text Index (also known as a crawl) - Population is the process of creating and maintaining a full-text index. (creating and building the index)
  12. 12. Terms and Keywords Population Filter Word breaker Stemmer
  13. 13. Terms and Keywords Population Given a specified file Filter extension such as .doc, filters extract text from a file stored in a Word breaker varbinary(max) or IMAGE column Stemmer
  14. 14. Terms and Keywords Population Filter For a given language, a word breaker tokenizes the text, identifies individual Word breaker words by determining where word boundaries Stemmer exist based on the lexical rules of the language
  15. 15. Terms and Keywords Population Filter For a given language, a stemmer Word generates inflectional breaker forms of a particular word based on the Stemmer rules of that language.
  16. 16. Terms and Keywords Population Token Filter Word breaker Stemmer
  17. 17. Terms and Keywords Population Token Filter Word breaker A token is a word or a Stemmer character string identified by a word breaker
  18. 18. Terms and Keywords Population Token Filter Word breaker STOPLIST Stemmer STOPWORD STOPWORD STOPWORD STOPWORD
  19. 19. Terms and Keywords Population Token A stopword is a word that Filter is not relevant to your search and is filtered out Word from indexing and query breaker processes. SQL Server 2008 introduces stoplists. STOPLIST Stemmer A stoplist is a list of stopwords STOPWORD STOPWORD STOPWORD STOPWORD
  20. 20. Terms and Keywords Full-Text Catalog Document Population Full-Text Index
  21. 21. Terms and Keywords Full-Text Catalog ADocument stores full-text index Population Full-Text information about Index significant words and their location within a given column
  22. 22. Terms and Keywords A full-text catalog is a Full-Text Catalog logical concept that Document refers to a group Population of Full-Text full-text indexes Index
  23. 23. Setting up Full-Text
  24. 24. Setting up Full-Text Creating a Full-Text index • A full-text index is a special type of token-based index • In order to create a full-text index on a table or a view, it must have a unique, single-column, non- nullable index • Can be created on columns of type: char, varchar, nchar, nvarchar, text, ntext, image, xml, varbinary, and varbinary(max) • Each index supports only a single language per column
  25. 25. Setting up Full-Text Creating a Full-Text index
  26. 26. Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 Nothing 3 1 3 Searching 3 1 5 •Demo
  27. 27. Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence The Keyword column contains a English 3 representation of a single token 1 7 Nothing 3 extracted at indexing time. Word 1 3 3 breakers determine what makes Searching 1 5 up a token. •Demo
  28. 28. Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 Nothing 3 1 3 Searching 3 1 5 •Demo
  29. 29. Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 The ColId column contains3a Nothing 3 1 value that corresponds to a Searching 3 1 5 particular column that is full- text indexed. •Demo
  30. 30. Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 Nothing 3 1 3 Searching 3 1 5 •Demo
  31. 31. Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 The DocId column contains Nothing 3 1 3 eight-byte integer values3 Searching 1 5 that maps to a particular full-text key value in a full- text indexed table. •Demo
  32. 32. Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 Nothing 3 1 3 Searching 3 1 5 •Demo
  33. 33. Index Structure ID Text_English Source Row: 1 there is nothing like searching in English Full-Text Index: Keyword ColId DocId Occurrence English 3 1 7 The Occurrence column contains an Nothing 3 1 3 integer value. For each DocId value, Searchinga list of occurrence values 3 1 5 there is that correspond to the relative word offsets of the particular keyword within that DocId. •Demo
  34. 34. Population Process Population methods 1. Full – A full population builds index entries for all the rows of the base table or indexed view 2. Change Tracking – SQL server tracks changes to the base table since the last population: 1. Auto 2. Manual 3. Incremental Timestamp-Based Population
  35. 35. Querying • Contains, Freetext – as a predicate (Where) Syntax: CONTAINS (column_name,search_string) • ContainsTable, FreetextTable – TVF, includes ranking. Syntax: SELECT * FROM CONTAINSTABLE (table_name,column_name,search_string, top n)
  36. 36. iFTS enhancements in SQL Server 2008 • Fully integrated into SQL Server • Stoplists • New Tools for Troubleshooting SQL Server 2008 Full- Text Search (DMVs) • A New Word Breaker Family (Hebrew and other languages) • Performance improvements (reasons: Integer Key, full integration)
  37. 37. Hebrew???? • • • • • DEMOS -
  38. 38. New DMVs and management tools • Sys.dm_fts_parser • sys.dm_fts_index_keywords • sys.dm_fts_index_keywords_by_doc • sys.fulltext_index_fragments • FULLTEXTCATALOGPROPERTY: – MergeStatus – PopulateStatus • OBJECTPROPERTYEX: – TableFulltextPopulateStatus – TableFulltextPendingChanges
  39. 39. Tips and Tricks • Why Scan if you can…… FORCESEEK – new hint- can help a bit in determining the query plan • When using contains don’t forget to use quotes (“) if searching more than one word • Use to escape special characters • To search quotes (“) in the text use "
  40. 40. Tips and Tricks • Use an integer key as the Unique index • Place Full-Text index on another filegroup • Performance degrades when full text index is fragmented - use reorganize for merge
  41. 41. Tapuz Solution • SQL 2008 64bit standard edition, 16 GB RAM, 2 quadcore CPU • Transactional replication • FT indexes on different FG than the main tables • Change tracking (AUTO) • Daily reorganizing fragmented indexes only • Hierarchy set of queries to make sure relevance results return first • Use Dynamic SQL so that dynamic search fields can be used
  42. 42. Results relevance sorting logic • Freetext ranking (okapi –bm25) • Contains • Contains all words (using AND) • Free search (freetext)
  43. 43. Numbers • Index sizes – 53 GB (~68 GB Data) • Number of rows indexed – >165M • AVG search time – 1.7 Sec • More than 97% of the searches respond in less than 7 Sec • Number of searches (2 months) – more than 6 million • Number of connections – ~900
  44. 44. Known issues found so far • High CPU load and intense disk IO during queries • Population and merges are resource intensive • Ranking not as a TVF?? – impossible • Statistics, query plans and join types are not always optimal –hints can’t be used • No scale out or partitioning options
  45. 45. References • Books Online • SQL Server 2008 Full-Text Search: Internals and Enhancements: http://technet.microsoft.com/en- us/library/cc721269.aspx#_Toc202506227 • Pro Full-Text in SQL Server 2008 by Michael Coles
  1. Gostou de algum slide específico?

    Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

×