Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Scalable Natural Language Processing (SNaP)
IBM Research | Almaden
SystemT:
Declarative Information Extraction
Yunyao Li
Text analytics
is the key for
Unlocking big data value
Customer 360º View
Social Media Analytics, CRM Analytics
Security &...
Text Analytics
Text Analytics vs Information Extraction
3
Information
Extraction
Information
Extraction
ClusteringClusteri...
Information Extraction All Over
… hotel close to Seaworld in Orlando?
Planning to go this coming weekend…
Social Media
Cus...
5
Extraction Example: Fact Extraction (Tables)
Singapore 2012 Annual Report
(136 pages PDF)
Identify note breaking down
Op...
6
Extraction Example: Sentiment Analysis
Mcdonalds mcnuggets are fake as shit but they so delicious.Mcdonalds mcnuggets ar...
Enterprise Requirements
• Accuracy
– Garbage-in garbage-out: Usefulness of analytics results is tied to the quality of
ext...
8
Core OperationsCore OperationsExecution EngineExecution Engine
SystemT’s 5 Layered Architecture
Compiler + OptimizerComp...
9
package com.ibm.avatar.algebra.util.sentence;
import java.io.BufferedWriter;
import java.util.ArrayList;
import java.uti...
10
State-of-the-art
industrial system
State-of-the-art
open-source system
SystemT
Scalability via Optimization
SystemT nee...
Products utilizing SystemT for pre-determined extraction tasksProducts utilizing SystemT for pre-determined extraction tas...
Case Study: Drug Discovery
0-I II III IV
Time to develop a drug; 12 -15 years
Avg cost to develop a drug: 1.2 billion
Adapted from PhRMA (Pharmaceuti...
LaboratoryLaboratory Pre-clinicalPre-clinical ClinicalClinical FDA ApprovalFDA Approval BedsideBedside
Objective: enable d...
Structure
Indication
Indication
Mode of action
/ target
Effect level
Side Effect
360° view on drugs allows for comparison ...
Patents also have (Manually Created)
Chemical Complex Work Units (CWU’s)
As text
Chemical names
found in the text of
docum...
Structure
Renders content searchable and avoid redundant and expensive tests
Building block for automation of the discov...
EfficacyEfficacyEfficacyEfficacyResults
StudyStudy
Authorship InterventionInterventionAuthorship InterventionAuthorship In...
Preliminary Results
SUGGEST NEW P53 KINASES
2. Cluster proteins by their literature distance: known p53
kinases cluster (g...
Summary
• SystemT
– Declarative IE system based on an algebraic framework
– Expressivity and performance advantages over g...
Thank you!
• For more information…
– Visit our website: https://ibm.biz/BdF4GQ
– Get a free copy:
• IBM BigInsights Quick ...
Upcoming SlideShare
Loading in …5
×

SystemT: Declarative Information Extraction

1,121 views

Published on

Slides used for my talk "SystemT: Declarative Information Extraction" at the event "University of Oregon Big Opportunities with Big Data Meeting" on August 8, 2014 (http://bigdata.uoregon.edu).

Published in: Technology
  • Be the first to comment

SystemT: Declarative Information Extraction

  1. 1. Scalable Natural Language Processing (SNaP) IBM Research | Almaden SystemT: Declarative Information Extraction Yunyao Li
  2. 2. Text analytics is the key for Unlocking big data value Customer 360º View Social Media Analytics, CRM Analytics Security & Privacy Email Analytics, Data Redaction, Digital Piracy Operations Analysis Machine Data Analytics Financial Analytics Systemic risk analysis, Event monitoring Healthcare Analytics Patient care, drug discovery and many more …
  3. 3. Text Analytics Text Analytics vs Information Extraction 3 Information Extraction Information Extraction ClusteringClustering ClassificationClassification RegressionRegression Core operations such as pos tagging, deep parsing, sentence detection etc. Core operations such as pos tagging, deep parsing, sentence detection etc. Information extraction: A key enabling technology for text analytics •Input: Raw text and features from low-level natural language processing operations •Output: Structured data suitable for general purpose analysis software
  4. 4. Information Extraction All Over … hotel close to Seaworld in Orlando? Planning to go this coming weekend… Social Media Customer 360º Security & Privacy Operations Analysis Financial Analytics my social security # is 400-05-4356 Email Machine Data Financial Statements Medical Records News CRM Patents Product:Hotel Location:Orlando Type:SSN Value:400054356 Storage Module 2 Drive 1 fault Event:DriveFail Loc:mod2.Slot1
  5. 5. 5 Extraction Example: Fact Extraction (Tables) Singapore 2012 Annual Report (136 pages PDF) Identify note breaking down Operating expenses line item, and extract opex components Identify note breaking down Operating expenses line item, and extract opex components Identify line item for Operating expenses from Income statement (financial table in pdf document) Identify line item for Operating expenses from Income statement (financial table in pdf document)
  6. 6. 6 Extraction Example: Sentiment Analysis Mcdonalds mcnuggets are fake as shit but they so delicious.Mcdonalds mcnuggets are fake as shit but they so delicious. We should do something cool like go to Z (kidding).We should do something cool like go to Z (kidding). Makin chicken fries at home bc everyone sucks!Makin chicken fries at home bc everyone sucks! You are never too old for Disney movies.You are never too old for Disney movies. Social Media Different domain  Different ball game! Bank X got me ****ed up today!Bank X got me ****ed up today! Customer Reviews Not a pleasant client experience. Please fix ASAP.Not a pleasant client experience. Please fix ASAP. I'm still hearing from clients that Merrill's website is better.I'm still hearing from clients that Merrill's website is better. X... fixing something that wasn't brokenX... fixing something that wasn't broken Analyst Research Reports Intel's 2013 capex is elevated at 23% of sales, above average of 16%Intel's 2013 capex is elevated at 23% of sales, above average of 16% IBM announced 4Q2012 earnings of $5.13 per share, compared with 4Q2011 earnings of $4.62 per share, an increase of 11 percent IBM announced 4Q2012 earnings of $5.13 per share, compared with 4Q2011 earnings of $4.62 per share, an increase of 11 percent We continue to rate shares of MSFT neutral.We continue to rate shares of MSFT neutral. Sell EUR/CHF at market for a decline to 1.31000…Sell EUR/CHF at market for a decline to 1.31000… FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury. Equity Fixed Income Currency
  7. 7. Enterprise Requirements • Accuracy – Garbage-in garbage-out: Usefulness of analytics results is tied to the quality of extraction • Scalability – Large data volumes, often orders of magnitude larger than classical NLP corpora • Social Media: – Twitter alone has 400M+ messages / day; 1TB+ per day • Financial Data: – SEC alone has 20M+ filings, several TBs of data, with documents range from few KBs to few MBs • Machine Data: – One application server under moderate load at medium logging level  1GB of logs per day • Transparency – Every customer’s data and problems are unique in some way – Need to quickly implement new business logic • Usability – Building an accurate IE system is labor-intensive – Professional programmers are expensive
  8. 8. 8 Core OperationsCore OperationsExecution EngineExecution Engine SystemT’s 5 Layered Architecture Compiler + OptimizerCompiler + Optimizer Extraction LanguageExtraction Language Generic LibrariesGeneric Libraries Tooling + UITooling + UI 20+ languages currently supported NE libraries, Sentiment, informal-text normalizer, … Scalable Embeddable Runtime AQL Cost based optimizer Eclipse tooling: editor, workflow analysis, pattern discovery, regex learner, profiler,…
  9. 9. 9 package com.ibm.avatar.algebra.util.sentence; import java.io.BufferedWriter; import java.util.ArrayList; import java.util.HashSet; import java.util.regex.Matcher; public class SentenceChunker { private Matcher sentenceEndingMatcher = null; public static BufferedWriter sentenceBufferedWriter = null; private HashSet<String> abbreviations = new HashSet<String> (); public SentenceChunker () { } /** Constructor that takes in the abbreviations directly. */ public SentenceChunker (String[] abbreviations) { // Generate the abbreviations directly. for (String abbr : abbreviations) { this.abbreviations.add (abbr); } } /** * @param doc the document text to be analyzed * @return true if the document contains at least one sentence boundary */ public boolean containsSentenceBoundary (String doc) { String origDoc = doc; /* * Based on getSentenceOffsetArrayList() */ // String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; do { /* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary (); if (boundary != -1) {doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1); String candidate = /* * Looks at the last character of the String. If this last * character is part of an abbreviation (as detected by * REGEX) then the sentenceString is not a fullSentence and * "false” is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) { /* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1); } if (candidate.length () > 0) { // sentences.addElement(candidate.trim().replaceAll("n", " // ")); // sentenceArrayList.add(new Integer(currentOffset + boundary // + 1)); // currentOffset += boundary + 1; // Found a sentence boundary. If the boundary is the last // character in the string, we don't consider it to be // contained within the string. int baseOffset = currentOffset + boundary + 1; if (baseOffset < origDoc.length ()) { // System.err.printf("Sentence ends at %d of %dn", // baseOffset, origDoc.length()); return true; } else { return false; } } // origDoc.substring(0,currentOffset)); // doc = doc.substring(boundary + 1); doc = remainder; } } while (boundary != -1); // If we get here, didn't find any boundaries. return false; } public ArrayList<Integer> getSentenceOffsetArrayList (String doc) { ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> (); // String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; sentenceArrayList.add (new Integer (0)); do { /* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary (); if (boundary != -1) { String candidate = doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1); /* * Looks at the last character of the String. If this last character * is part of an abbreviation (as detected by REGEX) then the * sentenceString is not a fullSentence and "false" is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) { /* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1); } if (candidate.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + boundary + 1)); currentOffset += boundary + 1; } // origDoc.substring(0,currentOffset)); doc = remainder; } } while (boundary != -1); if (doc.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + doc.length ())); } sentenceArrayList.trimToSize (); return sentenceArrayList; } private void setDocumentForObtainingBoundaries (String doc) { sentenceEndingMatcher = SentenceConstants. sentenceEndingPattern.matcher (doc); } private int getNextCandidateBoundary () { if (sentenceEndingMatcher.find ()) { return sentenceEndingMatcher.start (); } else return -1; } private boolean doesNotBeginWithPunctuation (String remainder) { Matcher m = SentenceConstants.punctuationPattern.matcher (remainder); return (!m.find ()); } private String getLastWord (String cand) { Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand); if (lastWordMatcher.find ()) { return lastWordMatcher.group (); } else { return ""; } } /* * Looks at the last character of the String. If this last character is * par of an abbreviation (as detected by REGEX) * then the sentenceString is not a fullSentence and "false" is returned */ private boolean isFullSentence (String cand) { // cand = cand.replaceAll("n", " "); cand = " " + cand; Matcher validSentenceBoundaryMatcher = SentenceConstants.validSentenceBoundaryPattern.matcher (cand); if (validSentenceBoundaryMatcher.find ()) return true; Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand); if (abbrevMatcher.find ()) { return false; // Means it ends with an abbreviation } else { // Check if the last word of the sentenceString has an entry in the // abbreviations dictionary (like Mr etc.) String lastword = getLastWord (cand); if (abbreviations.contains (lastword)) { return false; } } return true; } } Java Implementation of Sentence Boundary Detection Easy to Develop & Maintain create dictionary AbbrevDict from file 'abbreviation.dict’; create view SentenceBoundary as select R.match as boundary from ( extract regex /(([.?!]+s)|(ns*n))/ on D.text as match from Document D ) R where Not(ContainsDict('AbbrevDict', CombineSpans(LeftContextTok(R.match, 1),R.match))); create dictionary AbbrevDict from file 'abbreviation.dict’; create view SentenceBoundary as select R.match as boundary from ( extract regex /(([.?!]+s)|(ns*n))/ on D.text as match from Document D ) R where Not(ContainsDict('AbbrevDict', CombineSpans(LeftContextTok(R.match, 1),R.match))); Equivalent AQL Implementation
  10. 10. 10 State-of-the-art industrial system State-of-the-art open-source system SystemT Scalability via Optimization SystemT needs less than 10% of the cores to keep up with Twitter’s daily feed without drop in quality 10x50x
  11. 11. Products utilizing SystemT for pre-determined extraction tasksProducts utilizing SystemT for pre-determined extraction tasksProducts utilizing SystemT for pre-determined extraction tasksProducts utilizing SystemT for pre-determined extraction tasks Products fully exposingProducts fully exposing the AQL language and enginethe AQL language and engine Products fully exposingProducts fully exposing the AQL language and enginethe AQL language and engine SystemT in IBM Products InfoSphere BigInsights InfoSphere Streams Social Media Analytics eDiscovery Analyzer Optim Data Lifecycle Management Solutions SmartCloud Analytics: Log Analysis Watson Discovery Advisor Installed on 400+ client sites
  12. 12. Case Study: Drug Discovery
  13. 13. 0-I II III IV Time to develop a drug; 12 -15 years Avg cost to develop a drug: 1.2 billion Adapted from PhRMA (Pharmaceutical Research and Manufacturers of America) 2013 profile Laboratory 50,000 + Compounds Pre-Clinical 250 Compounds Clinical 5 Compounds FDA approval 1 compound “..Toxicity and Serious Adverse Events in Late Stage Drug Development are the Major Causes of Drug Failure” Intervention should happen at critical transition bottlenecks between stages (most likely to impact outcome) Drug Development Pipeline
  14. 14. LaboratoryLaboratory Pre-clinicalPre-clinical ClinicalClinical FDA ApprovalFDA Approval BedsideBedside Objective: enable data-driven decision making more efficient clinical trial design, data analytics and drug success / failure predictions Case Study: Drug Discovery
  15. 15. Structure Indication Indication Mode of action / target Effect level Side Effect 360° view on drugs allows for comparison between drugs, and therefore allow to predict drug behavior …Structured and unstructured data sources Extract known facts about a drug
  16. 16. Patents also have (Manually Created) Chemical Complex Work Units (CWU’s) As text Chemical names found in the text of documents As bitmap images Pictures of chemicals found in the document Images Unstructured data contain chemical compound information in many different forms Chemical nomenclature can be daunting Information Extraction Example
  17. 17. Structure Renders content searchable and avoid redundant and expensive tests Building block for automation of the discovery and decision process Internal reports Public Medline articles with limited structured info POPULATION gender FEMALE,MALE species RAT strain Sprague Dawley precondition pregnant From unstructured to structured content  Handle variants in entity reference  Disambiguate based on context, etc…  Extract relationships  Assemble complex concepts Combine rule-based extraction with Machine Learning  Handle variants in entity reference  Disambiguate based on context, etc…  Extract relationships  Assemble complex concepts Combine rule-based extraction with Machine Learning
  18. 18. EfficacyEfficacyEfficacyEfficacyResults StudyStudy Authorship InterventionInterventionAuthorship InterventionAuthorship Intervention PopulationPopulationPopulationPopulationPopulation Study name Size Population description Time interval Intervention Phase Type Authorship Author Author affiliation Funding org Compound Target Admin route dosage Type (preventive /therapeutic) Species Subspecies Gender Age Pre-existing conditions Physical attributes Geographic Location 1.Extractmentions2.Assemblementions intoinformation In-life Observations Hematology Clinical chem Comparative Macroscopic Results Comparative Microscopic Resutls
  19. 19. Preliminary Results SUGGEST NEW P53 KINASES 2. Cluster proteins by their literature distance: known p53 kinases cluster (green) and suggest others that may also phosphorylate p53. GREEN – p53 Kinases RED/ORANGE/YELLOW – Predicted Targets 1. Quantify the “literature distances” between proteins BMPR2CHEK2 Bag of Words Model Example Use Case: Drug Development
  20. 20. Summary • SystemT – Declarative IE system based on an algebraic framework – Expressivity and performance advantages over grammar- based IE systems – Text-specific optimizations – Key enabling technology for Big Data analytics applications • Ongoing Work – Tooling support for non-programmers – Improved optimization strategies – New operators for advanced features (e.g. co-reference resolution, text normalization)
  21. 21. Thank you! • For more information… – Visit our website: https://ibm.biz/BdF4GQ – Get a free copy: • IBM BigInsights Quick start Edition http://www- 01.ibm.com/software/data/infosphere/biginsights/quick-start/ • IBM Academic Initiative: http://www- 03.ibm.com/ibm/university/academic/pub/page/academic_initiative. InfoSphere BigInsights 2.1.1 (and Streams 3.0) are freely available for academic purposes. – Take free online classes: BigDataUniversity.com – Watch YouTube videos: • IBM InfoSphere BigInsights Text Analytics Channel: http://www.youtube.com/playlist?list=PL2C66C6E455AD0216 • IBM Big Data Channel: http://www.youtube.com/user/ibmbigdata • IBM InfoSphere Streams Channel: http://www.youtube.com/playlist? list=PLCF04A48C22F34B19 – Contact me • yunyaoli@us.ibm.com SystemT Team: Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy, Yunyao Li Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu SystemT Team: Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy, Yunyao Li Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu

×