What a developer should know about  Oracle Text
What is Oracle Text?   <ul><li>Oracle Text  is a powerful search technology built into Oracle11g Standard and Enterprise E...
Functionalities of Oracle Text <ul><li>Oracle Text can perform linguistic analysis on documents; search text using a varie...
Why do we need Oracle Text? <ul><li>A naive approach to implementing free-text search queries in a database could look som...
Oracle Text API <ul><li>Oracle Text provides a complete  SQL-based search API  that consists of custom query operators, DD...
Query Operators <ul><li>To enter an Oracle Text query, use the SQL SELECT statement. Depending on the type of index you cr...
Oracle Text Index types Index Type  Application Type  Query Operator  CONTEXT Use this index to build a text retrieval app...
Setting Up Oracle Text <ul><li>Oracle Text is installed with an Oracle Database XE installation by default. With other dat...
Creating CONTEXT   Index <ul><li>Oracle Text indexes retrievable data items before users are able to find content with sea...
Oracle Text Indexing Process
The Oracle Text Indexing Process <ul><li>The indexing process is split into multiple phases, which is configurable by the ...
Indexing Architecture <ul><li>1.   Datastore  - Datastore defines from where the text to be indexed should be fetched. Tha...
Indexing Architecture contn. <ul><li>2. Filter  - The filter stage is responsible for processing “formatted” documents suc...
Indexing Architecture contn. <ul><li>4. Lexer  - The lexer's job is to separate the sectioner's output into words or token...
Indexing Architecture contn. <ul><li>5. Indexing Engine  - The indexing engine creates the index that maps tokens to the d...
Creating Preferences <ul><li>A  preference  is an optional parameter that affects the way Oracle Text creates an index. Th...
Searching <ul><li>The  CONTAINS  operator is used for searching CONTEXT indexes. Wildcard characters can be used in CONTAI...
Index Maintenance <ul><li>Because base table data is replicated by the index, the data needs to be periodically synchroniz...
Glossary <ul><li>Stopwords  - Stopwords are words for which Oracle Text does not create an index entry. They are usually c...
Resources for further reading <ul><li>Oracle Text -  http://www.oracle.com/technology/pub/articles/asplund-textsearch.html...
That’s It!
Upcoming SlideShare
Loading in...5
×

oracle-text-search

36,985

Published on

Published in: Technology, Business
1 Comment
9 Likes
Statistics
Notes
No Downloads
Views
Total Views
36,985
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
667
Comments
1
Likes
9
Embeds 0
No embeds

No notes for slide

oracle-text-search

  1. 1. What a developer should know about Oracle Text
  2. 2. What is Oracle Text? <ul><li>Oracle Text is a powerful search technology built into Oracle11g Standard and Enterprise Editions </li></ul><ul><li>Oracle Text uses standard SQL to index, search, and analyze text stored in structured form inside Oracle database, or in unstructured form in either local file system, or on the Web. </li></ul><ul><li>Oracle Text search functionality includes: Boolean operators (AND, OR, NOT, NEAR etc), exact phrase match, section searching, fuzzy (words that are spelled similarly), stemming (search for mice and find mouse), wildcard, thesaurus (synonyms), stopwords, case sensitivity, search scoring proximity (Searches for words near one another), results ranking, and keyword highlighting. </li></ul>
  3. 3. Functionalities of Oracle Text <ul><li>Oracle Text can perform linguistic analysis on documents; search text using a variety of strategies including keyword searching, Boolean operations, pattern matching, mixed queries (combining both relational and unstructured data), HTML/XML section searching, etc. </li></ul><ul><li>Oracle Text can render search results in various formats including unformatted text, HTML with highlighting, and original document format. </li></ul><ul><li>Oracle Text supports multiple languages including Japanese, Korean, Traditional and Simplified Chinese. </li></ul>
  4. 4. Why do we need Oracle Text? <ul><li>A naive approach to implementing free-text search queries in a database could look something like this: </li></ul><ul><li>SELECT * FROM issues WHERE LOWER(description) LIKE '% color %' AND LOWER(description) LIKE '% pink %' </li></ul><ul><li>Using this technique, each keyword needs to be separately matched against each column, where it could appear to match the keywords in any order. Also relational databases is not designed to efficiently execute queries like the above (Using SQL traditional LIKE operator leads to full table scans), and using this approach would result in a very no scalable application. </li></ul><ul><li>Using the Oracle Text search, the query will be: </li></ul><ul><li>SELECT * FROM issues WHERE CONTAINS( description , 'color AND pink', 1) > 0; </li></ul>
  5. 5. Oracle Text API <ul><li>Oracle Text provides a complete SQL-based search API that consists of custom query operators, DDL syntax extensions, a set of PL/SQL procedures and database views. </li></ul><ul><li>Text API gives the application developer full control over indexing, queries, security, presentation, and software configuration that is sometimes required. </li></ul><ul><li>The basic Oracle Text query takes a query expression, usually a word with or without operators, as input. Oracle Text returns all documents (previously indexed) that satisfy the expression along with a relevance score for each document. Scores can be used to order the documents in the result set. </li></ul>
  6. 6. Query Operators <ul><li>To enter an Oracle Text query, use the SQL SELECT statement. Depending on the type of index you create, you use either the CONTAINS or CATSEARCH operator in the WHERE clause. </li></ul><ul><li>MATCHES operator is used to classify documents with a CTXRULE index. </li></ul><ul><li>You can use these operators programmatically wherever you can use the SELECT statement, such as in PL/SQL cursors. </li></ul>
  7. 7. Oracle Text Index types Index Type Application Type Query Operator CONTEXT Use this index to build a text retrieval application when your text consists of large coherent documents. You can index documents of different formats, such as MS Word, HTML, XML or plain text. With a CONTEXT index you can customize your index in a variety of ways. CONTAINS CTXCAT Use this index type to index small text fragments such as item names, prices and descriptions that are stored across columns. Particularly suited to mixed queries. CATSEARCH CTXRULE Use a CTXRULE index to build a document classification application. The CTXRULE index is an index created on a table of queries, where each query has a classification. Single documents (plain text, HTML or XML) can be classified using the MATCHES operator. MATCHES CTXXPATH Can only create this index on XMLType column to speed up ExistsNode() queries on an XMLType column. Use with ExistsNode()
  8. 8. Setting Up Oracle Text <ul><li>Oracle Text is installed with an Oracle Database XE installation by default. With other database editions, you need to install the Oracle Text feature yourself. Once the feature is present, you only need to create a normal database user and grant roles and privileges needed for execution of index management procedures in Oracle Text packages. </li></ul><ul><li>Step 1: Create User </li></ul><ul><li>CREATE USER myuser IDENTIFIED BY myuser_password; </li></ul><ul><li>Step 2: Grant Roles </li></ul><ul><li>GRANT RESOURCE, CONNECT, CTXAPP TO MYUSER; </li></ul><ul><li>Step 3: Grant EXECUTE Privileges on CTX PL/SQL Packages </li></ul><ul><li>GRANT EXECUTE ON CTX_CLS TO myuser; </li></ul><ul><li>GRANT EXECUTE ON CTX_DDL TO myuser; </li></ul><ul><li>GRANT EXECUTE ON CTX_DOC TO myuser; </li></ul><ul><li>GRANT EXECUTE ON CTX_OUTPUT TO myuser; </li></ul><ul><li>GRANT EXECUTE ON CTX_QUERY TO myuser; </li></ul><ul><li>GRANT EXECUTE ON CTX_REPORT TO myuser; </li></ul><ul><li>GRANT EXECUTE ON CTX_THES TO myuser; </li></ul>
  9. 9. Creating CONTEXT Index <ul><li>Oracle Text indexes retrievable data items before users are able to find content with search. </li></ul><ul><li>Oracle Text has different index types that are suitable for different purposes. For full-text search with large documents, the CONTEXT index is the appropriate index type. </li></ul><ul><li>By default, you can index values in a single column, but if you want to combine data from several tables, you need to create a custom PL/SQL filter procedure that will act as a storage abstraction. </li></ul><ul><li>The Oracle Text indexing process is modeled after a pipeline, where data items retrieved from a data store pass through a series of transformations before their keywords are added to the index. </li></ul>
  10. 10. Oracle Text Indexing Process
  11. 11. The Oracle Text Indexing Process <ul><li>The indexing process is split into multiple phases, which is configurable by the application developer. The indexing process includes the following phases: </li></ul><ul><li>Data Retrieval : Data is simply fetched from a data store, for example, a Web page, database large object, or local file system, and passed as a stream of data to the next phase. </li></ul><ul><li>Filtering: The filters are responsible for converting data in different file formats to plain text. The other components in the indexing pipeline only process plain text data and don't know about file formats such as Microsoft Word or Excel. </li></ul><ul><li>Sectioning: The sectioner adds metadata about the structure of the original data item. </li></ul><ul><li>Lexing: A stream of characters is split into words based on the language of the item. </li></ul><ul><li>Indexing: In this final phase, the keywords are added to the actual index. </li></ul>
  12. 12. Indexing Architecture <ul><li>1. Datastore - Datastore defines from where the text to be indexed should be fetched. That is, text which is stored within a database, on a file system, or accessed remotely via the HTTP protocol. Custom datastores may be defined which fetch the data from a location, protocol or application of the customer’s choice. </li></ul><ul><li>Default Datastore - The default datastore is in the database itself. Text may be stored in a VARCHAR2 column (up to 4000 characters), or in a CLOB (Character Large Object) column. Formatted text (such as Word or PDF documents) can be stored in BLOB (Binary Large Object) columns. </li></ul><ul><li>File Datastore - Text to be indexed is stored on any file system which is accessible to the database server. The name or path to the file is stored in the database, typically in a VARCHAR2 column. </li></ul><ul><li>URL Datastore - The database contains an HTTP protocol URL, and the text to be indexed is fetched directly from the URL at indexing time. </li></ul><ul><li>User Defined Datastore - A PL/SQL procedure is specified, which will be called for each row in the table being indexed. The PL/SQL procedure may, in turn, call other language programs such as Java or C/C++ programs via the EXTPROC external procedures mechanism. </li></ul>
  13. 13. Indexing Architecture contn. <ul><li>2. Filter - The filter stage is responsible for processing “formatted” documents such as Microsoft Office files or PDF documents. The built-in AUTO_FILTER recognizes all common document formats and can translate them into indexable HTML text. Application developers may replace the filter stage with their own custom-built filter, or a filter purchased from a third-party. </li></ul><ul><li>3. Sectioner - The sectioner object is responsible for identifying the containing section(s) for each text unit. Typically, these sections will be predefined HyperText Markup Language (HTML) or extensible Markup Language (XML) sections. Optionally, the sectioner can process all tags as sections delimiters. For example: The TITLE tag in <TITLE>XML Handbook</TITLE>. </li></ul><ul><li>The sectioner object separates the stream into text and section information. Section information includes where sections begin and end in the text stream. The section information is passed directly to the indexing engine which uses it later. The text is passed to the lexer. </li></ul>
  14. 14. Indexing Architecture contn. <ul><li>4. Lexer - The lexer's job is to separate the sectioner's output into words or tokens. To extract tokens, the lexer uses the parameters as defined in the lexer preference. These parameters include the definitions for the characters that separate tokens such as whitespace, and whether to convert the text to all uppercase or to leave it in mixed case. </li></ul><ul><li>Lexer Types - Oracle Text supports the indexing of different languages by enabling you to choose a lexer in the indexing process. The lexer you employ determines the languages you can index. For example, BASIC_LEXER supports English and most western European languages that use white space delimited words. </li></ul><ul><li>Lexer Preferences - There are many options available for fine-tuning the lexer. For example, the developer can choose that an index should be case sensitive or case insensitive, and can choose whether particular characters should split tokens or be indexed as part of them – for example, should “PL/SQL” be indexed as two terms “PL” and “SQL” or the single string “PL/SQL”. </li></ul>
  15. 15. Indexing Architecture contn. <ul><li>5. Indexing Engine - The indexing engine creates the index that maps tokens to the documents that contain them. </li></ul><ul><li>In this phase, Oracle uses the stoplist you specify to exclude stopwords or stopthemes from the index. Oracle also uses the parameters defined in the WORDLIST preference, which tell the system how to create a prefix index or substring index, if enabled. </li></ul><ul><li>The final output of the pipeline is an inverted index. This is a list of the words from the document, with each word having a list of documents in which it appears. </li></ul>
  16. 16. Creating Preferences <ul><li>A preference is an optional parameter that affects the way Oracle Text creates an index. There are preferences for datastore, filtering, lexers, storage, wordlist, section types, and more. A preference may or may not have attributes associated with it. Preferences are set with the CTX_DDL. CREATE_PREFERENCE procedure. </li></ul><ul><li>To create a stoplists, use CTX_DDL. CREATE_STOPLIST . You can add topwords to a stoplist with CTX_DDL. ADD_STOPWORD . </li></ul><ul><li>To create section groups, use CTX_DDL. CREATE_SECTION_GROUP and specify a section group type. </li></ul>
  17. 17. Searching <ul><li>The CONTAINS operator is used for searching CONTEXT indexes. Wildcard characters can be used in CONTAINS queries for prefix and suffix matching. </li></ul><ul><li>Syntax for CONTAINS operator: </li></ul><ul><li>CONTAINS( </li></ul><ul><li>[schema.]column, text_query VARCHAR2 [, label NUMBER] </li></ul><ul><li>)RETURN NUMBER; </li></ul><ul><li>CONTAINS returns a relevance score for every row selected. You obtain this score with the SCORE operator. The SCORE operator can be used in a SELECT, ORDER BY, or GROUP BY clause. </li></ul><ul><li>For languages that are supported by Oracle Text, fuzzy matching and stemming are enabled by default. To leverage these advanced search features you simply need to use the fuzzy() or $ query operators, respectively, with the CONTAINS operator. </li></ul>
  18. 18. Index Maintenance <ul><li>Because base table data is replicated by the index, the data needs to be periodically synchronized to the index. Index maintenance procedures can be found in the CTX_DDL PL/SQL package. </li></ul><ul><li>In 11g users can specify at index creation the index update preference: manually, on commit, or at regular intervals. </li></ul><ul><li>Following is an example on how the index can be updated manually to reflect base table changes: </li></ul><ul><li>EXECUTE ctx_ddl.sync_index('issue_index', '2M'); </li></ul><ul><li>It is also possible to have the database automatically execute this task at regular intervals. You can also choose to use the operating system or other scheduling facilities to initiate synchronization. </li></ul>
  19. 19. Glossary <ul><li>Stopwords - Stopwords are words for which Oracle Text does not create an index entry. They are usually common words in your language that are unlikely to be searched on by themselves. Oracle Text includes a default list of stopwords for your language. This list is called a stoplist . For example, in English, the words this and that are defined as stopwords in the default stoplist. You can modify the default stoplist or create new stoplists with the CTX_DDL package. </li></ul><ul><li>Wordlist - An Oracle Text preference that enables features such as fuzzy, stemming, and prefix indexing for better wildcard searching, as well as substring and prefix indexing, which improves performance for wildcard queries with CONTAINS and CATSEARCH. </li></ul><ul><li>Mixed Queries - queries which have a text search part and a structured part. That is queries which have CONTAINS clause along with additional constraints using AND operator. </li></ul><ul><li>Fuzzy Operator - This operator enables you to search for words that have similar spelling to specified word. </li></ul><ul><li>Stem Operator - This operator enables you to search for words that have the same root as the specified term. For example, a stem of $sing expands into a query on the words sang, sung, sing. </li></ul><ul><li>Printjoin - non-alphanumeric characters that are to be included in index tokens, so that words such as web-site are indexed as web-site . </li></ul><ul><li>Startjoin - One or more non-alphanumeric characters that, when encountered as the first character in a token explicitly identify the start of the token. </li></ul><ul><li>Endjoin - One or more non-alphanumeric characters that, when encountered as the last character in a token, explicitly identify the end of the token. </li></ul>
  20. 20. Resources for further reading <ul><li>Oracle Text - http://www.oracle.com/technology/pub/articles/asplund-textsearch.html </li></ul><ul><li>Downloads - http://www.oracle.com/technology/products/text/index.html </li></ul><ul><li>Oracle Text FAQ - http://www.oracle.com/technology/products/text/x/FAQs/imt_Faq.html </li></ul><ul><li>Preferences – http ://download.oracle.com/docs/cd/B10500_01/text.920/a96518/cdatadic.htm#34966 </li></ul><ul><li>Oracle Text SQL operators - http://www.itk.ilstu.edu/docs/oracle/text.101/b10730/csql.htm#i997503 </li></ul><ul><li>CONTAINS query operators - http ://stanford.edu/dept/itss/docs/oracle/10g/text.101/b10730/cqoper.htm#i998027 </li></ul><ul><li>Optimize Text Retrieval - http://www.oracle.com/technology/oramag/oracle/04-sep/o54text.html </li></ul><ul><li>Oracle Text Performance - http://www.oracle.com/technology/products/text/x/faqs/imt_perf_faq.html#q04 </li></ul>
  21. 21. That’s It!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×