Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2017-01-25-SystemT-Overview-Stanford

167 views

Published on

  • Be the first to comment

  • Be the first to like this

2017-01-25-SystemT-Overview-Stanford

  1. 1. 1
  2. 2. 2
  3. 3. Bank 6 23% Bank 5 20% Bank 4 21% Bank 3 5% Bank 2 15% Bank 1 15% 3
  4. 4. Operations Analysis 4
  5. 5. 5
  6. 6. 6
  7. 7. 7 We are raising our tablet forecast. S are NP We S raising NP forecastNP tablet DET our subj obj subj pred Dependency Tree Oct 1 04:12:24 9.1.1.3 41865: %PLATFORM_ENV-1-DUAL_PWR: Faulty internal power supply B detected Time Oct 1 04:12:24 Host 9.1.1.3 Process 41865 Category %PLATFORM_ENV-1- DUAL_PWR Message Faulty internal power supply B detected
  8. 8. 88 Singapore 2012 Annual Report (136 pages PDF) Identify note breaking down Operating expenses line item, and extract opex components Identify line item for Operating expenses from Income statement (financial table in pdf document)
  9. 9. 9
  10. 10. 10 Intel's 2013 capex is elevated at 23% of sales, above average of 16% FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury. I'm still hearing from clients that Merrill's website is better. Customer or competitor? Good or bad? Entity of interest
  11. 11. 11
  12. 12. ’ ’ 12
  13. 13. 13
  14. 14. 14
  15. 15.  15
  16. 16. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet, sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in Tokenization (preprocessing step) Level 1 Gazetteer[type = LastGaz]  Last Gazetteer[type = FirstGaz]  First Token[~ “[A-Z]w+”]  Caps Rule priority used to prefer First over Caps • Rule priority used to prefer First over Caps. • Lossy Sequencing: annotations dropped because input to next stage must be a sequence – First preferred over Last since it was declared earlier 16
  17. 17. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet, sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent Tokenization (preprocessing step) Level 1 Gazetteer[type = LastGaz]  Last Gazetteer[type = FirstGaz]  First Token[~ “[A-Z]w+”]  Caps Level 2 First Last  Person First Caps  Person First  Person Rigid Rule Priority and Lossy Sequencing in Level 1 caused partial results 17
  18. 18. • • • • • • 18
  19. 19. 19
  20. 20. AQL Language Optimizer Operator Graph Specify extractor semantics declaratively (express logic of computation, not control flow) Choose efficient execution plan that implements semantics Optimized execution plan executed at runtime 20
  21. 21. 21
  22. 22. Document text: String Person last: Spanfirst: Span fullname: Span 22
  23. 23. 23 23
  24. 24. Mark Scott Anna … DocumentInput Tuple … we will meet Mark Scott and … Output Tuple 2 Span 2Document Span 1Output Tuple 1 Document Dictionary 24
  25. 25. 25
  26. 26. Dictionary <First> Smith Scott Tomorrow Mark Scott Howard Smith Join <First> <Caps> Join <First> <Last> Mark Scott HowardSmith Mark Scott HowardSmith Union Mark Scott HowardSmith Mark Scott HowardSmith Scott Mark Howard Consolidate Mark Scott HowardSmith Dictionary <Last> Regex <Caps> ……Tomorrow, we will meet Mark Scott, Howard Smith … Explicit operator for resolving ambiguity Input may contain overlapping annotations (No Lossy Sequencing problem) Output may contain overlapping annotations (No Rigid Matching Regimes) Scott Mark Howard 26
  27. 27. create view FirstCaps as select CombineSpans(F.name, C.name) as name from First F, Caps C where FollowsTok(F.name, C.name, 0, 0); <First> <Caps> 0 tokens 27
  28. 28. create view Person as select S.name as name from ( ( select CombineSpans(F.name, C.name) as name from First F, Caps C where FollowsTok(F.name, C.name, 0, 0)) union all ( select CombineSpans(F.name, L.name) as name from First F, Last L where FollowsTok(F.name, L.name, 0, 0)) union all ( select * from First F ) ) S consolidate on name; <First><Caps> <First><Last> <First> 28
  29. 29. create view Person as select S.name as name from ( ( select CombineSpans(F.name, C.name) as name from First F, Caps C where FollowsTok(F.name, C.name, 0, 0)) union all ( select CombineSpans(F.name, L.name) as name from First F, Last L where FollowsTok(F.name, L.name, 0, 0)) union all ( select * from First F ) ) S consolidate on name; Explicit clause for resolving ambiguity (No Rigid Priority problem) Input may contain overlapping annotations (No Lossy Sequencing problem) 29
  30. 30. 30 Deep Syntactic Parsing ML Training & Scoring Core Operators Tokenization Parts of Speech Dictionaries Regular Expressions Span Operations Relational Operations Semantic Role Labels Language to express NLP Algorithms  AQL …. Aggregation Operations
  31. 31. 31 package com.ibm.avatar.algebra.util.sentence; import java.io.BufferedWriter; import java.util.ArrayList; import java.util.HashSet; import java.util.regex.Matcher; public class SentenceChunker { private Matcher sentenceEndingMatcher = null; public static BufferedWriter sentenceBufferedWriter = null; private HashSet<String> abbreviations = new HashSet<String> (); public SentenceChunker () { } /** Constructor that takes in the abbreviations directly. */ public SentenceChunker (String[] abbreviations) { // Generate the abbreviations directly. for (String abbr : abbreviations) { this.abbreviations.add (abbr); } } /** * @param doc the document text to be analyzed * @return true if the document contains at least one sentence boundary */ public boolean containsSentenceBoundary (String doc) { String origDoc = doc; /* * Based on getSentenceOffsetArrayList() */ // String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; do { /* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary (); if (boundary != -1) {doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1); String candidate = /* * Looks at the last character of the String. If this last * character is part of an abbreviation (as detected by * REGEX) then the sentenceString is not a fullSentence and * "false” is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) { /* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1); } if (candidate.length () > 0) { // sentences.addElement(candidate.trim().replaceAll("n", " // ")); // sentenceArrayList.add(new Integer(currentOffset + boundary // + 1)); // currentOffset += boundary + 1; // Found a sentence boundary. If the boundary is the last // character in the string, we don't consider it to be // contained within the string. int baseOffset = currentOffset + boundary + 1; if (baseOffset < origDoc.length ()) { // System.err.printf("Sentence ends at %d of %dn", // baseOffset, origDoc.length()); return true; } else { return false; } } // origDoc.substring(0,currentOffset)); // doc = doc.substring(boundary + 1); doc = remainder; } } while (boundary != -1); // If we get here, didn't find any boundaries. return false; } public ArrayList<Integer> getSentenceOffsetArrayList (String doc) { ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> (); // String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; sentenceArrayList.add (new Integer (0)); do { /* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary (); if (boundary != -1) { String candidate = doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1); /* * Looks at the last character of the String. If this last character * is part of an abbreviation (as detected by REGEX) then the * sentenceString is not a fullSentence and "false" is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) { /* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1); } if (candidate.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + boundary + 1)); currentOffset += boundary + 1; } // origDoc.substring(0,currentOffset)); doc = remainder; } } while (boundary != -1); if (doc.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + doc.length ())); } sentenceArrayList.trimToSize (); return sentenceArrayList; } private void setDocumentForObtainingBoundaries (String doc) { sentenceEndingMatcher = SentenceConstants. sentenceEndingPattern.matcher (doc); } private int getNextCandidateBoundary () { if (sentenceEndingMatcher.find ()) { return sentenceEndingMatcher.start (); } else return -1; } private boolean doesNotBeginWithPunctuation (String remainder) { Matcher m = SentenceConstants.punctuationPattern.matcher (remainder); return (!m.find ()); } private String getLastWord (String cand) { Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand); if (lastWordMatcher.find ()) { return lastWordMatcher.group (); } else { return ""; } } /* * Looks at the last character of the String. If this last character is * par of an abbreviation (as detected by REGEX) * then the sentenceString is not a fullSentence and "false" is returned */ private boolean isFullSentence (String cand) { // cand = cand.replaceAll("n", " "); cand = " " + cand; Matcher validSentenceBoundaryMatcher = SentenceConstants.validSentenceBoundaryPattern.matcher (cand); if (validSentenceBoundaryMatcher.find ()) return true; Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand); if (abbrevMatcher.find ()) { return false; // Means it ends with an abbreviation } else { // Check if the last word of the sentenceString has an entry in the // abbreviations dictionary (like Mr etc.) String lastword = getLastWord (cand); if (abbreviations.contains (lastword)) { return false; } } return true; } } Java Implementation of Sentence Boundary Detection create dictionary AbbrevDict from file 'abbreviation.dict’; create view SentenceBoundary as select R.match as boundary from ( extract regex /(([.?!]+s)|(ns*n))/ on D.text as match from Document D ) R where Not(ContainsDict('AbbrevDict', CombineSpans(LeftContextTok(R.match, 1),R.match))); Equivalent AQL Implementation 31
  32. 32. 32
  33. 33. 33
  34. 34. 34
  35. 35. 35 Tokenization overhead is paid only once First (followed within 0 tokens) Plan C Plan A Join Caps Restricted Span Evaluation Plan B First Identify Caps starting within 0 tokens Extract text to the right Caps Identify First ending within 0 tokens Extract text to the left
  36. 36. 0 100 200 300 400 500 600 700 0 20 40 60 80 100 Average document size (KB) Throughput(KB/sec) Open Source Entity Tagger SystemT 10~50x faste [Chiticariu et al., ACL’10] 36
  37. 37. [Chiticariu et al., ACL’10] Dataset Document Size Throughput (KB/sec) Average Memory (MB) Range Average ANNIE SystemT ANNIE SystemT Web Crawl 68 B – 388 KB 8.8 KB 42.8 498.8 201.8 77.2 Medium SEC Filings 240 KB – 0.9 MB 401 KB 26.3 703.5 601.8 143.7 Large SEC Flings 1 MB – 3.4 MB 1.54 MB 21.1 954.5 2683.5 189.6 37
  38. 38. 38
  39. 39. • • • • • • 39
  40. 40. PersonPhone Person PhonePerson Anna at James St. office (555-5555) …. ’ • • create view PersonPhone as select P.name as person, N.number as phone from Person P, Phone N where Follows(P.name, N.number, 0, 30); Person Phone t1 t2 t3 t1  t3 t2  t3 Provenance: Boolean expression 40
  41. 41. 41
  42. 42. 2013 2015 2016 2017 • UC Santa Cruz (full Graduate class) 2014 • U. Washington (Grad) • U. Oregon (Undergrad) • U. Aalborg, Denmark (Grad) • UIUC (Grad) • U. Maryland Baltimore County (Undergrad) • UC Irvine (Grad) • NYU Abu-Dhabi (Undergrad) • U. Washington (Grad) • U. Oregon (Undergrad) • U. Maryland Baltimore County (Undergrad) • … • UC Santa Cruz, 3 lectures in one Grad class SystemT MOOC 42
  43. 43. 43
  44. 44. 44
  45. 45. 45
  46. 46. create dictionary PurchaseVerbs as ('buy.01', 'purchase.01', 'acquire.01', 'get.01'); create view Relation as select A.verb as BUY_VERB, R2.head as PURCHASE, A.polarity as WILL_BUY from Action A, Role R where MatchesDict('PurchaseVerbs', A.verbClass); and Equals(A.aid, R.aid) and Equals(R.type, 'A1'); ACL ‘15, ‘16, EMNLP ‘16, COLING ’16a, ‘16b, ‘16c • • 46
  47. 47. 47
  48. 48. 48
  49. 49. 49
  50. 50. 50
  51. 51. 51
  52. 52.  52
  53. 53. Ease of Programming Ease of Sharing 53
  54. 54. 54 R1: create view Phone as Regex(‘d{3}-d{4}’, Document, text); R2: create view Person as Dictionary(‘first_names.dict’, Document, text); Dictionary file first_names.dict: anna, james, john, peter… R3: create table PersonPhone(match span); insert into PersonPhone select Merge(F.match, P.match) as match from Person F, Phone P where Follows(F.match, P.match, 0, 60); Person PhonePerson Person Phone Anna at James St. office (555-5555), or James, her assistant - 777-7777 have the details. • • 54
  55. 55.      Person Dictionary FirstNames.dict Doc PersonPhone Join Follows(name,phone,0,60) James James555-5555 Phone Regex /d{3}-d{4}/ 555-5555 PhonePerson Anna at James St. office (555-5555), … 55
  56. 56. 56 56 HLC 2 Remove James from output of R2’ Dictionary op. HLC 3: Remove James555-5555 from output of R3’s join op. HLC 1 Remove 555-5555 from output of R1’s Regex op. true Merge(F.match, P.match) as match ⋈Follows(F.match,P.match,0,60) Dictionary ‘firstName.dict’, text Regex ‘d{3}-d{4}’, text R2 R1 R3 Doc Goal: remove “James  555-5555” from output  56
  57. 57. 57 57 HLC 2 Remove James from output of R2’ Dictionary op. HLC 3: Remove James555-5555 from output of R3’s join op. HLC 1 Remove 555-5555 from output of R1’s Regex op. true Merge(F.match, P.match) as match ⋈Follows(F.match,P.match,0,60) Dictionary ‘firstName.dict’, text Regex ‘d{3}-d{4}’, text R2 R1 R3 Doc Goal: remove “James  555-5555” from output LLC 1 Remove ‘James’ from FirstNames.dict LLC 2 Add filter pred. on street suffix in right context of match LLC 3 Reduce character gap between F.match and P.match from 60 to 10  57
  58. 58. 58 ⋈ Dictionary ContainsDict()     Contains IsContained Overlaps  “ PersonPhone PersonPhone ” 58
  59. 59. 59 • Input: – Set of HLCs, provenance graph, labeled results • Output: – List of LLCs, ranked based on improvement in F1-measure • Algorithm: – For each operator Op, consider all HLCs (ti, Op) – For each HLC, enumerate all possible LLCs – For each LLC: • Compute the set of local tuples it removes from the output of Op • Propagate these removals up through the provenance graph to compute the effect on end-to-end result – Rank LLCs 59
  60. 60.  – t Dictionary   Op – k Op – k  – O(n2) n  Op ti Op ti  ++ + + + ++ + + + + + + + + + + + + + + + +++ - - - -- - -- - - -- - - - - -- - - - - - - - - -Tuples to remove from output of Op Output tuples 60
  61. 61. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Baseline I1 I2 I3 I4 I5 Enron ACE CoNLL EnronPP 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Baseline I1 I2 I3 I4 I5 Enron ACE CoNLL EnronPP 61 Precision improves greatly after a few iterations, while recall remains fairly stable Precision – % correct results of total results identified Recall – % correct results identified of total correct labels Person extraction on formal text (CoNLL, ACE) Person and PersonPhone extraction on informal text (Enron) 61
  62. 62. 0 10 20 30 40 50 60 70 80 90 F1- measure 62  Almost all expert’s refinements are among top 12 generated refinements  Done in 2 minutes ! Expert A after 1 hour, 9 refinements Person extraction on informal text (Enron) 62
  63. 63. 63
  64. 64. Development Environment AQL Extractor create view ProductMention as select ... from ... where ... create view IntentToBuy as select ... from ... where ... Cost-based optimization ... Discovery tools for AQL development SystemT Runtime Input Documents Extracted Objects Challenge: Building extractors for enterprise applications requires an information extraction system that is expressive, efficient, transparent and usable. Existing solutions are either rule-based solutions based on cascading grammar with expressivity and efficiency issues, or black-box solutions based on machine learning with lack of transparency. Our Solution: A declarative information extraction system with cost-based optimization, high-performance runtime and novel development tooling based on solid theoretical foundation [PODS’13, PODS’14], shipping with over 10+ IBM products. AQL: a declarative language that can be used to build extractors outperforming the state-of-the-arts [ACL’10] Multilingual SRL-enabled: [ACL’15, ACL’16, EMNLP’16, COLING’16] A suite of novel development tooling leveraging machine learning and HCI [EMNLP’08, VLDB’10, ACL’11, CIKM’11, ACL’12, EMNLP’12, CHI’13, SIGMOD’13, ACL’13,VLDB’15, NAACL’15] Cost-based optimization for text-centric operations [ICDE’08, ICDE’11, FPL’13, FPL’14] Highly embeddable runtime with high-throughput and small memory footprint. [SIGMOD Record’09, SIGMOD’09] For details and Online Class visit: https://ibm.biz/BdF4GQ 64
  65. 65. 65

×