Declarative Multilingual Information Extraction with SystemT

125 views

Published on

"Declarative Multilingual Information Extraction with SystemT" presented by Laura Chiticariu, IBM Research - Almaden as part of the Cognitive Systems Institute Speaker Series.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
125
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Declarative Multilingual Information Extraction with SystemT

  1. 1. SystemT: Declarative Information Extraction Laura Chiticariu, IBM Research Cognitive Systems Institute Speaker Series 11/17/2016
  2. 2. • Use cases • Requirements • SystemT • SystemT courses Outline
  3. 3. Case Study 1: Sentiment Analysis Text Analytics Product catalog, Customer Master Data, … Social Media • Products Interests360o Profile • Relationships • Personal Attributes • Life Events Statistical Analysis, Report Gen. Ban k6 23 % Ban k5 20 % Ban k4 21 % Ban k3 5% Ban k2 15 % Ban k1 15 % Scale 500M+ tweets a day, 100M+ consumers, … Breadth Buzz, intent, sentiment, life events, personal atts, … Complexity Sarcasm, wishful thinking, …
  4. 4. Case Study 2: Machine Analysis Web Server Logs Custom Application Logs Database Logs Raw Logs Parsed Log Records Parse Extract Linkage Information End-to-End Application Sessions Integrate Graphical Models (Anomaly detection) Analyze 5 min 0.85 Cause Effect Delay Correlation 10 min 0.62 Correlation Analysis • Every customer has unique components with unique log record formats • Need to quickly customize all stages of analysis for these custom log data sources
  5. 5. • Use cases • Requirements • SystemT • SystemT courses Outline
  6. 6. Expressivity • Need to express complex algorithms for a variety of tasks and data sources Scalability • Large number of documents and large-size documents Transparency • Every customer’s data and problems are unique in some way • Need to easily comprehend, debug and enhance extractors Enterprise Requirements
  7. 7. Expressivity Example: Different Kinds of Parses We are raising our tablet forecast. S are NP We S raising NP forecastNP tablet DET our subj obj subj pred Natural Language Machine Log Dependency Tree Oct 1 04:12:24 9.1.1.3 41865: %PLATFORM_ENV-1-DUAL_PWR: Faulty internal power supply B detected Time Oct 1 04:12:24 Host 9.1.1.3 Process 41865 Category %PLATFORM_ENV-1-DUAL_PWR Message Faulty internal power supply B detected
  8. 8. Expressivity Example: Fact Extraction (Tables) 8 Singapore 2012 Annual Report (136 pages PDF) Identify note breaking down Operating expenses line item, and extract opex components Identify line item for Operating expenses from Income statement (financial table in pdf document)
  9. 9. • Social Media: Twitter alone has 500M+ messages / day; 1TB+ per day • Financial Data: SEC alone has 20M+ filings, several TBs of data, with documents range from few KBs to few MBs • Machine Data: One application server under moderate load at medium logging level à 1GB of logs per day Scalability Example
  10. 10. Transparency Example: Need to incorporate in-situ data Intel's 2013 capex is elevated at 23% of sales, above average of 16% FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury. I'm still hearing from clients that Merrill's website is better. Customer or competitor? Good or bad? Entity of interest I need to go back to Walmart, Toys R Us has the same toy $10 cheaper!
  11. 11. • Use cases • Requirements • SystemT • SystemT courses Outline
  12. 12. AQL Language Optimizer Operator Runtime Specify extractor semantics declaratively Choose efficient execution plan that implements semantics Example AQL Extractor Fundamental Results & Theorems • Expressivity: The class of extraction tasks expressible in AQL is a strict superset of that expressible through cascaded regular automata. • Performance: For any acyclic token-based finite state transducer T, there exists an operator graph G such that evaluating T and G has the same computational complexity. create view PersonPhone as select P.name as person, N.number as phone from Person P, PhoneNum N, Sentence S where Follows(P.name, N.number, 0, 30) and Contains(S.sentence, P.name) and Contains(S.sentence, N.number) and ContainsRegex( /b(phone|at)b/ , SpanBetween(P.name, N.number) ); Within a singlesentence <Person> <PhoneNum> 0-30 chars Contains“phone” or “at” Within a singlesentence <Person> <PhoneNum> 0-30 chars Contains“phone” or “at” More than 35 papers across ACL, EMNLP, VLDB, PODS and SIGMOD SystemT: IBM Research project started in 2006 SystemT Architecture
  13. 13. package com.ibm.avatar.algebra.util.sentence; import java.io.BufferedWriter; import java.util.ArrayList; import java.util.HashSet; import java.util.regex.Matcher; public class SentenceChunker { private Matcher sentenceEndingMatcher = null; public static BufferedWriter sentenceBufferedWriter = null; private HashSet<String> abbreviations = new HashSet<String> (); public SentenceChunker () { } /** Constructor that takes in the abbreviations directly. */ public SentenceChunker (String[] abbreviations) { // Generate the abbreviations directly. for (String abbr : abbreviations) { this.abbreviations.add (abbr); } } /** * @param doc the document text to be analyzed * @return true if the document contains at least one sentence boundary */ public boolean containsSentenceBoundary (String doc) { String origDoc = doc; /* * Based on getSentenceOffsetArrayList() */ // String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; do { /* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary (); if (boundary != -1) {doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1); String candidate = /* * Looks at the last character of the String. If this last * character is part of an abbreviation (as detected by * REGEX) then the sentenceString is not a fullSentence and * "false” is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) { /* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1); } if (candidate.length () > 0) { // sentences.addElement(candidate.trim().replaceAll("n", " // ")); // sentenceArrayList.add(new Integer(currentOffset + boundary // + 1)); // currentOffset += boundary + 1; // Found a sentence boundary. If the boundary is the last // character in the string, we don't consider it to be // contained within the string. int baseOffset = currentOffset + boundary + 1; if (baseOffset < origDoc.length ()) { // System.err.printf("Sentence ends at %d of %dn", // baseOffset, origDoc.length()); return true; } else { return false; } } // origDoc.substring(0,currentOffset)); // doc = doc.substring(boundary + 1); doc = remainder; } } while (boundary != -1); // If we get here, didn't find any boundaries. return false; } public ArrayList<Integer> getSentenceOffsetArrayList (String doc) { ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> (); // String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; sentenceArrayList.add (new Integer (0)); do { /* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary (); if (boundary != -1) { String candidate = doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1); /* * Looks at the last character of the String. If this last character * is part of an abbreviation (as detected by REGEX) then the * sentenceString is not a fullSentence and "false" is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) { /* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1); } if (candidate.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + boundary + 1)); currentOffset += boundary + 1; } // origDoc.substring(0,currentOffset)); doc = remainder; } } while (boundary != -1); if (doc.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + doc.length ())); } sentenceArrayList.trimToSize (); return sentenceArrayList; } private void setDocumentForObtainingBoundaries (String doc) { sentenceEndingMatcher = SentenceConstants. sentenceEndingPattern.matcher (doc); } private int getNextCandidateBoundary () { if (sentenceEndingMatcher.find ()) { return sentenceEndingMatcher.start (); } else return -1; } private boolean doesNotBeginWithPunctuation (String remainder) { Matcher m = SentenceConstants.punctuationPattern.matcher (remainder); return (!m.find ()); } private String getLastWord (String cand) { Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand); if (lastWordMatcher.find ()) { return lastWordMatcher.group (); } else { return ""; } } /* * Looks at the last character of the String. If this last character is * par of an abbreviation (as detected by REGEX) * then the sentenceString is not a fullSentence and "false" is returned */ private boolean isFullSentence (String cand) { // cand = cand.replaceAll("n", " "); cand = " " + cand; Matcher validSentenceBoundaryMatcher = SentenceConstants.validSentenceBoundaryPattern.matcher (cand); if (validSentenceBoundaryMatcher.find ()) return true; Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand); if (abbrevMatcher.find ()) { return false; // Means it ends with an abbreviation } else { // Check if the last word of the sentenceString has an entry in the // abbreviations dictionary (like Mr etc.) String lastword = getLastWord (cand); if (abbreviations.contains (lastword)) { return false; } } return true; } } Java Implementation of Sentence Boundary Detection Expressivity: AQL vs. Custom Java Code create dictionary AbbrevDict from file 'abbreviation.dict’; create view SentenceBoundary as select R.match as boundary from ( extract regex /(([.?!]+s)|(ns*n))/ on D.text as match from Document D ) R where Not(ContainsDict('AbbrevDict', CombineSpans(LeftContextTok(R.match, 1),R.match))); Equivalent AQL Implementation
  14. 14. Expressivity: Rich Set of Operators Deep Syntactic Parsing ML Training & Scoring Core Operators Tokenization Parts of Speech Dictionaries Semantic Role Labels Language to express NLP Algorithms à AQL ….Regular Expressions HTML/XML Detagging Span Operations Relational Operations Aggregation Operations
  15. 15. Expressivity: Multilingual Support create dictionary PurchaseVerbs as ('buy.01', 'purchase.01', 'acquire.01', 'get.01'); create view Relation as select A.verb as BUY_VERB, R2.head as PURCHASE, A.polarity as WILL_BUY from Action A, Role R where MatchesDict('PurchaseVerbs', A.verbClass); and Equals(A.aid, R.aid) and Equals(R.type, 'A1'); ACL ‘15, ‘16, EMNLP ‘16, COLING ‘16 • Parsing all languages into same predicate argument structure (language-independent) • Common predicate argument structure surfaced in AQL
  16. 16. Scalability: Class of Optimizations in SystemT • Rewrite-based: rewrite algebraic operator graph – Shared Dictionary Matching – Shared Regular Expression Evaluation – Regular Expression Strength Reduction – On-demand tokenization • Cost-based: relies on novel selectivity estimation for text-specific operators – Standard transformations • E.g., push down selections – Restricted Span Evaluation • Evaluate expensive operators on restricted regions of the document 16 Tokenization overhead is paid only once Person (followed within 30 tokens) Plan C Plan A Join Phone Restricted Span Evaluation Plan B Person Identify Phone starting within 30 tokens Extract text to the right Phone Identify Person ending within 30 tokens Extract text to the left
  17. 17. Transparency in SystemT PersonPhone • Clear and simple provenance à Person PhonePerson Anna at James St. office (555-5555) …. [Liu et al., VLDB’10] Ease of comprehension Ease of debugging Ease of enhancement • Explains output data in terms of input data, intermediate data, and the transformation • For predicate-based rule languages (e.g., SQL), can be computed automatically create view PersonPhone as select P.name as person, N.number as phone from Person P, PhoneNum N where Follows(P.name, N.number, 0, 30); person phone Anna 555-5555 James 555-5555 Person name Anna James Phone number 555-5555t1 t2 t3 t1 Ù t3 t2 Ù t3
  18. 18. Web Tools Overview Ease of Programming Ease of Sharing
  19. 19. • Use cases • Requirements • SystemT • SystemT courses Outline
  20. 20. SystemT in Universities 2013 2015 2016 2017 • UC Santa Cruz (full Graduate class) 2014 • U. Washington (Grad) • U. Oregon (Undergrad) • U. Aalborg, Denmark (Grad) • UIUC (Grad) • U. Maryland Baltimore County (Undergrad) • UC Irvine (Grad) • NYU Abu-Dhabi (Undergrad) • U. Washington (Grad) • U. Oregon (Undergrad) • U. Maryland Baltimore County (Undergrad) • … • UC Santa Cruz, 3 lectures in one Grad class SystemT MOOC
  21. 21. Alon Halevy, November 2015 "The tutorial of System T was an important hands-on component of a Ph.D course on Structured Data on the Web at the University of Aalborg in Denmark taught by Alon Halevy. The tutorial did a great job giving the students the feeling for the challenges involved in extracting structured data from text.“ Shimei Pan, University of Maryland, Baltimore County, May 2016 "The Information Extraction with SystemT class, which was offered for the first time at UMBC, has been a wonderful experience for me and my students. I really enjoyed teaching this course. Students were also very enthusiastic. Based on the feedback from my students, they have learned a lot. Some of my students even want me to offer this class every semester.“ SystemT in Universities: Testimonials
  22. 22. • Professors like teaching it • Students do interesting projects and have fun! • Identify character relationships in Games of Thrones • Determine the cause and treatment for PTSD patients • Extract job experiences and skills from resumes • … SystemT in Universities: Summary
  23. 23. SystemT MOOC Bring SystemT to 100,000+ students Target Audience: NLP Engineer, Data Scientist e.g., University students (grad, undergrad), industry professionals Learning Path with two classes TA0105: Text Analytics: Getting Results with SystemT • Basics of IE with SystemT, learning to write extractors using the web tool, learn AQL • 5 lectures + 1 hands-on lab + final exam TA0106EN: Advanced Text Analytics: Getting Results with SystemT • Advanced material on declarative approach and SystemT optimization • 2 lectures + final exam
  24. 24. Summary create view Company as select ... from ... where ... create view SentimentFeaturesas select … from …; AQLExtractorsAQLExtractors Embedded machine learning model AQL Rules create view SentimentForCompany as select T.entity, T.polarity from classifyPolarity (SentimentFeatures) T; create view Company as select ... from ... where ... create view SentimentFeaturesas select … from …; AQLExtractorsAQLExtractors Embedded machine learning model AQL Rules AQLExtractorsAQLExtractors Embedded machine learning model Embedded machine learning model AQL Rules create view SentimentForCompany as select T.entity, T.polarity from classifyPolarity (SentimentFeatures) T; Declarative AQL language Development Environment Cost-based optimization ... Discovery tools for AQL development SystemT Runtime Input Documents Extracted Objects AQL: a declarative language that can be used to build extractors outperforming the state-of-the-art [ACL’10, EMNLP’10, PODS’13,PODS’14,EMNLP Tutorial’15] Multilingual: [ACL’15,ACL’16,EMNLP’16,COLING’16] A suite of novel development tooling leveraging machine learning and HCI [EMNLP’08, VLDB’10, ACL’11, CIKM’11, ACL’12, EMNLP’12, CHI’13, SIGMOD’13, ACL’13,VLDB’15, NAACL’15] Cost-based optimization for text- centric operations [ICDE’08, ICDE’11, FPL’13, FPL’14] Highly embeddable runtime with high- throughput and small memory footprint. [SIGMOD Record’09, SIGMOD’09] A declarative information extraction system with cost-based optimization, high-performance runtime and novel development tooling based on solid theoretical foundation [PODS’13, 14]. Shipping with 10+ IBM products, and installed on 400+ client’s sites. InfoSphere Streams InfoSphere BigInsights SystemTSystemT IBM Engines …
  25. 25. Find out more about SystemT ! https://ibm.biz/BdF4GQ Try out SystemT Watch a demo Learn about using SystemT in unversity courses

×