Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SAP HANA SPS10- Text Analysis & Text Mining

6,757 views

Published on

See what's new in SAP HANA SPS10- Text Analysis & Text Mining

Published in: Technology

SAP HANA SPS10- Text Analysis & Text Mining

  1. 1. 1© 2014 SAP AG or an SAP affiliate company. All rights reserved. SAP HANA SPS 10 - What’s New? Text Analysis & Text Mining SAP HANA Product Management June, 2015 (Delta from SPS 09 to SPS 10)
  2. 2. Text Analysis
  3. 3. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 3Public Agenda – Text Analysis New or Improved Features  Text Analysis XS API  Grammatical role analysis – Controlled Beta !  Metadata properties extraction  Processing performance  Document filters  Enterprise fact extraction  Chinese module  Polish linguistic analysis  Part-of-speech mappings in East Asian languages  Core extraction: PERSON & PRODUCT
  4. 4. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 4Public New Text Analysis XS API For on-demand processing, text analysis can now be officially used via the SAP HANA Extended Application Services (SAP HANA XS) API: • Alternative to persisting output data to the $TA table • Bypasses creating the full-text index This option results in lower memory consumption and faster processing. Simple scenario supplying the configuration to the session object and the literal text: var TA = new $.text.analysis.Session ({ configuration: "LINGANALYSIS_FULL" }); var result = TA.analyze({ inputDocumentText:"New York, New York, that city is a dream." });
  5. 5. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 5Public New Grammatical Role Analysis (1/3) – Controlled Beta ! Optional analyzer that identifies syntactic relationships between elements of a sentence in the form of subject–verb–object expressions, commonly known as ‘triples’. [SUBJECT]The big brown cat[/SUBJECT] on the red couch [VERB]was eating[/VERB] a [DIRECTOBJECT]dead mouse[/DIRECTOBJECT]. The following grammatical roles describe arguments of verbs that are supported: • Subject person, place, thing, or idea that is doing or being something: Oracle bought Responsys. • DirectObject recipient of the action: Oracle bought Responsys. • IndirectObject affected by the action but not primary object: Oracle offered Responsys an improved contract. • OtherObject often prepositional object: They talked about the contract. • Predicate object of the verb to be: This is a revised version. An additional grammatical role supported, which does not describe a function with respect to a verb: • PredicateSubject subject of a predicative expression: The contract is new.
  6. 6. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 6Public New Grammatical Role Analysis (2/3) – Controlled Beta ! Input: Oracle was rumored to buy marketing-software maker Responsys Inc. for $1.5 billion. Output: TA_RULE TA_COUNTER TA_TOKEN TA_TYPE TA_PARENT TA_OFFSET Entity Extraction 1 Oracle ORGANIZATION/COMMERCIAL ? 0 Entity Extraction 2 marketing-software maker NOUN_GROUP ? 26 Entity Extraction 3 Responsys Inc. ORGANIZATION/COMMERCIAL ? 51 Entity Extraction 4 $1.5 billion CURRENCY ? 70 Grammatical Role 5 Oracle Subject 7 0 Grammatical Role 6 Oracle Subject 8 0 Grammatical Role 7 rumored Root/MainVerb/Passive ? 11 Grammatical Role 8 buy MainVerb/Active ? 22 Grammatical Role 9 marketing-software maker Responsys Inc. DirectObject 8 26 Grammatical Role 10 $1.5 billion OtherObject/for 8 70 Notes: • Core extraction is included in the configuration (1 - 4) • Each grammatical role is either the governor (verb) or dependent (verb argument) • TA_TYPE holds the details about its grammatical role • TA_PARENT holds the TA_COUNTER value of its corresponding governor • It is possible for a single dependent to be the argument (5 and 6) of two different verbs
  7. 7. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 7Public New Grammatical Role Analysis (3/3) – Controlled Beta ! The grammatical role analyzer (GRA) ships with Text Analysis in SAP HANA SPS10 but is hidden and not documented. Beta candidates must be familiar with Text Analysis in SAP HANA and prequalified by product management before receiving the required/missing delivery unit and instructions. GRA is available via SQL or Text Analysis XS API scenarios and processes text only in English.
  8. 8. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 8Public New Metadata Properties Extraction The option OutputMetadata in the text analysis configuration file now indicates whether to include document metadata in the output. If metadata is desired, the property value should be specified with a boolean "true" value: <boolean-value>true</boolean-value> The following metadata properties are extracted: • Author • Date • Date Created • Date Modified • Description • Keyword • Language • Subject • Title • Version • FromEmailAddress • FromName • ToEmailAddress • ToName • CcEmailAddress • CcName • BccEmailAddress • BccName
  9. 9. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 9Public Improved Processing Performance Greater throughput for these text data preprocessing steps:  Tokenization  Stemming  Part-of-Speech tagging  Noun Group extraction  Core extraction Utilizes more efficient internal data transfers Applies to text analysis configurations:  LINGANALYSIS_BASIC or STEMS or FULL  EXTRACTION_CORE 5 - 25% less time Depending upon language, text analysis and hardware configurations
  10. 10. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 10Public Improved Document Filters Document filters in the NLP engine automatically detect and extract text content and metadata from almost any type of binary file format from PPT to XLS to PDF, etc. • Performance and viewing fidelity improvements • Bug fixes
  11. 11. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 11Public Improved Enterprise Extraction for English Greater precision and recall for the extraction of entities and facts that are of particular interest to the enterprise domain. Dictionaries were augmented and rules further fine-tuned. The following major fact types are classified:  Membership Information: information about a person’s affiliations  Management Changes: information about management changes  Product Releases: information about product releases  Mergers & Acquisitions: information about mergers and acquisitions  Organizational Information: founder, location or contact information
  12. 12. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 12Public Improved Chinese Module Aligned coverage for Simplified and Traditional Chinese. Previous separate Simplified and Traditional Chinese modules have been consolidated into a single Chinese module, which simplifies maintenance and reduces storage footprint. More importantly, it allows for processing mixed Chinese languages in a single document.
  13. 13. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 13Public Improved Language Support for Polish Full linguistic analysis support by adding Part-of- Speech (POS) tagging and Noun Group (concept) extraction for Polish. Language LINGANALYSIS_BASIC LINGANALYSIS_STEMS LINGANALYSIS_FULL Arabic   Catalan   Chinese (Simplified)   Chinese (Traditional)   Croatian   Czech   Danish   Dutch   English   Farsi   French   German   Greek  Hebrew   Hungarian  Indonesian   Italian   Japanese   Korean   Norwegian (Bokmal)   Norwegian (Nynorsk)   Polish   NEW Portuguese   Romanian  Russian   Serbian   Slovak   Slovenian   Spanish   Swedish   Thai   Turkish  
  14. 14. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 14Public Improved Part-of-Speech Mappings in East Asian Languages Previously, part-of-speech (POS) tag names in several East Asian languages were mapped to ‘unknown’. Now the TA_TYPE output reflects what a native speaker would expect. POS tag (Language Reference Guide) TA_TYPE ($TA table) Language(s) Cl Noun Chinese/Thai classifier Suf Noun Japanese suffix
  15. 15. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 15Public Improved Predefined Core Extraction: PERSON & PRODUCT NAME_DESIGNATOR c/o, attn TITLE President PERSON Barack Obama PEOPLE Greeks LANGUAGE Greek ADDRESS1 245 First Street Floor 16 ADDRESS2 Cambridge, MA 02142 LOCALITY Cambridge REGION@MINOR Napa County REGION@MAJOR Connecticut COUNTRY Brazil CONTINENT South America GEO_FEATURE Mount Fuji GEO_AREA Scandinavia ORGANIZATION@COMMERCIAL AT&T ORGANIZATION@EDUCATIONAL University of Washington ORGANIZATION@OTHER FBI PRODUCT iPhone TICKER NYSE:SAP SOCIAL_MEDIA@TWITTER_ID @SAP SOCIAL_MEDIA@TWITTER_TOPIC #HANA DATE 2/14/2011 DAY Monday MONTH June YEAR 2011 TIME 3:47pm TIME_PERIOD 3 days, from 9 to 5pm HOLIDAY Memorial Day CURRENCY 17 euros MEASURE 217 meters PERCENT 4% PHONE 617-677-2030 NIN@US_SSN 522-89-2255 NIN@FR_INSEE xxx NIN@CA_SIN xxx URI@EMAIL john.smith@sap.com URI@IP 165.14.2.0 URI@URL http://sap.com Syntactic Entities: NOUN_GROUP big umbrella PROP_MISC Cup o’ Soup Updated Updated Augmented modules in all core extraction languages (except Arabic, Farsi, Korean and Russian)
  16. 16. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 16Public How to find SAP HANA documentation on this topic? SAP HANA Advanced Data Processing  What’s New in the SAP HANA Advanced Data Processing (Release Notes)  Development – File Loader Guide for SAP HANA – SAP HANA Search Developer Guide – SAP HANA Text Analysis Developer Guide  References – SAP HANA Text Analysis Extraction Customization Guide – SAP HANA Text Analysis Language Reference Guide – SAP HANA Text Analysis XS JavaScript API • In addition to this learning material, you find SAP HANA documentation on SAP Help Portal knowledge center at http://help.sap.com/hana_options_adp. • The knowledge center is structured according to the product lifecycle: installation, security, administration, development.
  17. 17. Text Mining
  18. 18. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 18Public Agenda – Text Mining New or Improved Features  Text Mining SQL extensions  Similarity measures – Jaccard, Dice, Overlap  Additional configurations  Text Mining XS API – SQLCC
  19. 19. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 19Public New Text Mining SQL Extensions (1/2) In addition to the SAP HANA XS API, text mining is now exposed via the following SQL functions: TM_GET_RELATED_TERMS returns top-ranked terms related to a term TM_GET_RELEVANT_TERMS returns top-ranked relevant terms (key phrases) that describe a document TM_GET_SUGGESTED_TERMS returns top-ranked terms matching an initial substring TM_GET_RELATED_DOCUMENTS returns top-ranked documents related to a document TM_GET_RELEVANT_DOCUMENTS returns top-ranked documents relevant to a term TM_CATEGORIZE_KNN enables categorization
  20. 20. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 20Public New Text Mining SQL Extensions (2/2) The following example returns terms starting with ‘john’ from a news table: select T.RANK, T.TERM, T.DOCUMENT_FREQUENCY, T.SCORE FROM TM_GET_SUGGESTED_TERMS ( TERM 'john' SEARCH DISTINCT "content" FROM "myschema"."news" RETURN TOP 5 ) AS T WHERE T.SCORE > 0.1 and T.TERM_FREQUENCY > 5 Rank TERM NORMALIZED_TERM TERM_TYPE TERM_FREQUENCY DOCUMENT_FREQUENCY SCORE 1 John john Proper 25 11 0.86 2 John Doe john doe Proper 12 8 0.24 3 John Miller john miller Proper 5 3 0.21 4 Johnny johnny Proper 7 7 0.15 5 Johnson and Sons johnson and sons Organization 5 2 0.09 Note that both single-word and multi-word terms are returned.
  21. 21. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 21Public New Similarity Measures – Jaccard, Dice (Sørensen), Overlap The similarity of two documents is related to how much their vectors point in the same direction. Text mining now provides several standard similarity measures. Previously, only cosine was offered. <property name="similarityFunction">COSINE</property> Three additional similarity measures are supported, which differ in the treatment of matching vs. differing term weights: JACCARD, DICE, OVERLAP
  22. 22. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 22Public New Additional Configurations Support The configuration properties specified at initialization time serve later as defaults for unspecified parameters when text mining functions are invoked on the given reference data. The configuration is user-configurable and hosted in the SAP HANA repository. The name of the default configuration file in the repository is DEFAULT.textminingconfig. You can now create additional custom configuration files, which can be specified by name when text mining is initialized.
  23. 23. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 23Public Improved Text Mining XS API – SQLCC Text mining applications need to have the proper credentials to access the reference data and other tables. By default, text mining uses the credentials of the user running the application. For security reasons we recommend to only grant users access to the application and not directly to the database tables. To support this and better manageability, text mining now allows you to specify an alternate anonymous/technical user via SQLCC that should be used during a session to determine access rights. Deploy the application along with a .xssqlc file that specifies a technical user with the desired access rights. The application needs to open a connection to the .xssqlcc file and specify it when creating the text mining session.
  24. 24. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 24Public How to find SAP HANA documentation on this topic? SAP HANA Advanced Data Processing  What’s New in the SAP HANA Advanced Data Processing (Release Notes)  Development – File Loader Guide for SAP HANA – SAP HANA Text Mining Developer Guide  References – SAP HANA Text Mining XS JavaScript API – SQL Reference for Options • In addition to this learning material, you find SAP HANA documentation on SAP Help Portal knowledge center at http://help.sap.com/hana_options_adp. • The knowledge center is structured according to the product lifecycle: installation, security, administration, development.
  25. 25. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 25Public Disclaimer This presentation outlines our general product direction and should not be relied on in making a purchase decision. This presentation is not subject to your license agreement or any other agreement with SAP. SAP has no obligation to pursue any course of business outlined in this presentation or to develop or release any functionality mentioned in this presentation. This presentation and SAP’s strategy and possible future developments are subject to change and may be changed by SAP at any time for any reason without notice. This document is provided without a warranty of any kind, either express or implied, including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. SAP assumes no responsibility for errors or omissions in this document, except if such damages were caused by SAP intentionally or grossly negligent.
  26. 26. © 2015 SAP SE or an SAP affiliate company. All rights reserved. Thank you Contact information Anthony Waite SAP HANA Product Management AskSAPHANA@sap.com

×