Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
I, Robot, Esquire 
Information Extraction and Summarization in Legal Documents 
jmundt@ebrevia.com | (203) 870-3000 Propri...
Who we are 
Commercializing machine learning technology developed at 
Columbia University to make legal document review mo...
Management Team 
Large law firm experience; 
tech startup experience; 
sales & business 
development experience 
Harvard L...
The Future of Law 
“In contrast, in looking 25 years ahead from 
now, I argue that it would be absurd to expect 
lawyers a...
I, Robot, Esquire - Overview 
Motivation 
Can we use ML and NLP? 
eBrevia Solution – Deep Dive 
Challenges and Lessons Lea...
Corporate Mergers and Due Diligence 
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 
6 
Business due 
dili...
Corporate mergers and due diligence 
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 
7
Legal Due Diligence Process 
Extract Summarize Analyze Advise 
Teams of junior 
attorneys billed out at 
$300-$500/hour po...
Legal Due Diligence Summary 
Here come the 
spreadsheets – 
summarize ALL the 
contracts: 
– leases 
– executive 
employme...
The Stone Age 
On site data 
room with reams 
of documents, 
organized by 
seller 
Buyer’s agents 
travel to evaluate 
the...
State of the Art – Virtual Data Rooms 
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 
11 
Digitized, but ...
The Future is Here 
Misses stems, synonyms, plural forms 
False positives—some common words also have 
special meanings in...
I, Robot, Esquire - Overview 
Motivation 
Can we use ML and NLP? 
eBrevia Solution – Deep Dive 
Challenges and Lessons Lea...
Can we use ML and NLP? 
Actually many sub-problems: 
Classify entire document type— 
discover contracts amongst 
heterogen...
Why this is Easy 
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 
15 
Precise, formal 
writing 
Extremely ...
Why this is Hard 
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 
16 
Precise, formal 
writing 
Extremely ...
Detecting “Evil” Clauses? 
Lawyers actually prefer 
to make the calls on 
exactly what to 
include, and how to 
advise the...
I, Robot, Esquire - Overview 
Motivation 
Can we use ML and NLP? 
eBrevia Solution – Deep Dive 
Challenges and Lessons Lea...
eBrevia’s Approach 
Not all provisions are the same! 
Topic modeling 
Information Extraction (IE) 
Rule based approach 
jm...
Text analysis pipeline 
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 
20 
OCR 
Sentence 
Segmentation 
N...
Classifier Features 
Basic textual analysis feature 
– words 
– n-grams 
– positional and morphological 
features. 
– Name...
Hunting for Training Data 
All your customer’s data 
is confidential 
– Redacted contracts 
– Mine the SEC 
Expense of law...
Hacks and Special Cases 
Very useful, but boring 
Formatting fixes specific to legal documents 
– ALL CAPS 
– Handling of ...
I, Robot, Esquire - Overview 
Motivation 
Can we use ML and NLP? 
eBrevia Solution – Deep Dive 
Challenges and Lessons Lea...
The Audacity of Keywords 
Seemingly-reliable keywords, aren’t 
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confident...
The Tyranny of Paper 
Lawyers still have a lot 
of paper – over 50% of 
the documents uploaded 
to our system are scans. 
...
Welcoming our Robot Lawyer Overlords 
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 
27 
“[eBrevia’s soft...
User Interface Notes 
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 
28
User Interface Notes 
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 
29 
Highlight in original, formatted...
User Interface Notes 
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 
30 
Additional critical features 
Qu...
I, Robot, Esquire - Overview 
Motivation 
Can we use ML and NLP? 
eBrevia Solution – Deep Dive 
Challenges and Lessons Lea...
Current Research and Future Directions 
Coreference resolution: intra- and inter 
document. Useful for doc references, and...
Feedback Learning from Lawyers 
Some lawyers 
are just bad 
Noise is NOT 
random 
– They fall for 
the same 
“trap” 
– The...
Current Research and Future Directions 
Other upcoming applications for eBrevia’s 
technology: 
Contract management 
Docum...
Thank You – Contact Info 
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 
35
Upcoming SlideShare
Loading in …5
×

Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

2,866 views

Published on

I, Robot, Esquire: Information Extraction and Summarization in Legal Documents

Pundits constantly predict the demise of many types of knowledge workers at the hands of intelligent machines, and few professionals perform more textual document review than lawyers. In this session, I’ll share work that eBrevia has been doing to apply research from the fields of ML and NLP to summarize and extract information from legal contracts to help accelerate corporate mergers and acquisitions. I will look at the unique characteristics of the legal industry, examine some supervised and semi-supervised training strategies and classification models, and discuss the limitations of these techniques and the essential role lawyers will continue to play.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL

  1. 1. I, Robot, Esquire Information Extraction and Summarization in Legal Documents jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential Jacob Mundt – MLConf ATL
  2. 2. Who we are Commercializing machine learning technology developed at Columbia University to make legal document review more efficient, accurate and cost effective. One of four national winners in Startup America DEMO Competition One of CIO.com’s top ten enterprise products at DEMO Fall 2012 Most Promising Software Product of the Year award from Connecticut Technology Council jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential Completed Connecticut Innovations’ TechStart Fund Program 2
  3. 3. Management Team Large law firm experience; tech startup experience; sales & business development experience Harvard Law jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential Led R&D team at tech company extracting data in medical industry Columbia Masters; NLP researcher Founder of Ivy Link (20+ staff); Chief of Staff of 350-person real estate private equity firm Harvard Law; law firm & in-house experience Ned Gannon CEO Adam Nguyen COO Jake Mundt CTO 3
  4. 4. The Future of Law “In contrast, in looking 25 years ahead from now, I argue that it would be absurd to expect lawyers and courts to carry on operating as they do now.” —Richard Susskind, Tomorrow’s lawyers jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 4 “Well, if droids could think, there'd be none of us here, would there?” — Obi-Wan Kenobi
  5. 5. I, Robot, Esquire - Overview Motivation Can we use ML and NLP? eBrevia Solution – Deep Dive Challenges and Lessons Learned Future directions jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 5
  6. 6. Corporate Mergers and Due Diligence jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 6 Business due diligence Legal due diligence Closing
  7. 7. Corporate mergers and due diligence jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 7
  8. 8. Legal Due Diligence Process Extract Summarize Analyze Advise Teams of junior attorneys billed out at $300-$500/hour poring over hundreds of contracts in virtual data rooms to summarize their content and identify red flags. jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 8
  9. 9. Legal Due Diligence Summary Here come the spreadsheets – summarize ALL the contracts: – leases – executive employment agreements – supplier agreements – Loan/credit agreements Extract key data points Also extract any clauses that discuss particular provisions jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 9
  10. 10. The Stone Age On site data room with reams of documents, organized by seller Buyer’s agents travel to evaluate the target, under constant supervision jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 10
  11. 11. State of the Art – Virtual Data Rooms jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 11 Digitized, but not machine readable Some simple OCR and searching capability Commercial systems like IntraLinks have advanced capabilities, but mostly focused on security and auditability.
  12. 12. The Future is Here Misses stems, synonyms, plural forms False positives—some common words also have special meanings in context. Impossible to find dates, parties, dollar amounts, or any other generic quantities We can do better jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 12
  13. 13. I, Robot, Esquire - Overview Motivation Can we use ML and NLP? eBrevia Solution – Deep Dive Challenges and Lessons Learned Future directions jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 13
  14. 14. Can we use ML and NLP? Actually many sub-problems: Classify entire document type— discover contracts amongst heterogeneous corpus Duplicate detection Group documents that were based on a common form agreement Automatically flagging questionable docs for further review Automatic provision extraction jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 14
  15. 15. Why this is Easy jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 15 Precise, formal writing Extremely structured Lots of clause reuse
  16. 16. Why this is Hard jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 16 Precise, formal writing Extremely structured Lots of clause reuse Obfuscation High demands on recall Deep chains of defined term references
  17. 17. Detecting “Evil” Clauses? Lawyers actually prefer to make the calls on exactly what to include, and how to advise the client Just find the source material, and let the lawyer decide. Determine relevance, don’t make value judgments “Learning to detect spyware using end user license agreements”, Lavesson, et al. (2009) jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 17 Illustration of Saint Wolfgang and the Devil with the Devil's Contract, by Michael Pacher.
  18. 18. I, Robot, Esquire - Overview Motivation Can we use ML and NLP? eBrevia Solution – Deep Dive Challenges and Lessons Learned Future directions jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 18
  19. 19. eBrevia’s Approach Not all provisions are the same! Topic modeling Information Extraction (IE) Rule based approach jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 19 • Find sentences discussing “change of control” • Find restrictions concerning confidential information • The contract runs from TIMEX to TIMEX. • The monthly rent will start at $X, and increase by no more than Y% annually. • Find every borrower’s FICO score
  20. 20. Text analysis pipeline jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 20 OCR Sentence Segmentation NLP Processing (POS, NER, Parsing) Document Structure tagging General Candidate detection Rule Based detection Topic classifier Candidate detection for IE Information Extraction and slot filling
  21. 21. Classifier Features Basic textual analysis feature – words – n-grams – positional and morphological features. – Named entities Syntactic features – Parts of speech – Parse tree and heads Structural features – First level classifier pass for determining document structure – Especially important on scanned documents where these features aren’t readily available jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 21 The/O buyer/O Acme/ORG Inc./ORG indemnify indemnify Client shall indemnify N V V Section III: Miscellaneous 1. Lorem ipsum dolor a. sit amet, consectetur
  22. 22. Hunting for Training Data All your customer’s data is confidential – Redacted contracts – Mine the SEC Expense of lawyer-labeled training data – Bootstrapping – Co-training with different feature sets – Active learning jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 22
  23. 23. Hacks and Special Cases Very useful, but boring Formatting fixes specific to legal documents – ALL CAPS – Handling of amendments – Handwritten signature blocks Hand crafted rules very good for high-precision heuristics—customers expect the software not to miss “easy” provisions. jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 23
  24. 24. I, Robot, Esquire - Overview Motivation Can we use ML and NLP? eBrevia Solution – Deep Dive Challenges and Lessons Learned Future directions jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 24
  25. 25. The Audacity of Keywords Seemingly-reliable keywords, aren’t jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 25 Phrase Likelihood that candidate phrase is relevant Likelihood candidate phrase is irrelevant “Change [of|in] Control” 48.4% 51.6% “13(d) and 14(d)” 98.7% 1.3% A simple keyword based search with an obvious keyword wouldn’t even get us to 50% precision! Conversely, a human would have never discovered this reliable trigram heuristic.
  26. 26. The Tyranny of Paper Lawyers still have a lot of paper – over 50% of the documents uploaded to our system are scans. OCR on poor quality scans works poorly for keyword searching but decently with ML, with properly constructed features. jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 26
  27. 27. Welcoming our Robot Lawyer Overlords jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 27 “[eBrevia’s software] cuts down significantly on time by performing 50-60% of the work up front and then you work from there.” – NY law firm partner “Your product is a great fit for our firm’s approach to practicing law.” – Partner, national law firm
  28. 28. User Interface Notes jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 28
  29. 29. User Interface Notes jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 29 Highlight in original, formatted document Cross-referencing, editing, and corrections
  30. 30. User Interface Notes jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 30 Additional critical features Quick Correction Level of confidence indications (similar to Google voice transcription) Good generic text search features to make human review easy
  31. 31. I, Robot, Esquire - Overview Motivation Can we use ML and NLP? eBrevia Solution – Deep Dive Challenges and Lessons Learned Future directions jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 31
  32. 32. Current Research and Future Directions Coreference resolution: intra- and inter document. Useful for doc references, and entity references. Machine learning for document cross-referencing and definition resolution. Automatic summarization of longer provisions to provide quick overviews. Understanding the lineage of a document – where its various pieces came from, and how they were changed. jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 32
  33. 33. Feedback Learning from Lawyers Some lawyers are just bad Noise is NOT random – They fall for the same “trap” – They’re often bad in the same way jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 33 So can’t use noise-tolerant learning algorithms to deal with this. Consensus models, model user reputation/ability
  34. 34. Current Research and Future Directions Other upcoming applications for eBrevia’s technology: Contract management Document drafting Lease abstraction Financial/Compliance Consumer applications jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 34
  35. 35. Thank You – Contact Info jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential 35

×