On Semi-Supervised Learning
Of Legal Semantics
L. Thorne McCarty
Rutgers University
Three Papers
● 1998: Structured Casenotes: How Publishers Can Add Value to
Public Domain Legal Materials on the World Wide Web.
● 2007: Deep Semantic Interpretations of Legal Texts.
● 2015: How to Ground a Language for Legal Discourse in a
Prototypical Perceptual Semantics.
And a Proposal:
A research strategy to produce a computational summary
of a legal case, which can be scaled up to a realistic legal
corpus.
The Challenge
A structured casenote is a computational summary of the
procedural history of a case along with the substantive legal
conclusions articulated at each stage of the process. It would play
the same role in the legal information systems of the 21st century
that West Headnotes and Key Numbers have played in the 20th
century.
From my 1998 paper:
Why focus on procedural history?
The traditional case brief focuses on the procedural context first:
Who is suing whom, and for what? What is the plaintiff's legal
theory? What facts does the plaintiff allege to support this theory?
How does the defendant respond? How does the trial court
dispose of the case? What is the basis of the appeal? What
issues of law are presented to the appellate court? How does the
appellate court resolve these issues, and with what justification?
Think about the traditional “brief” that students are
taught to write in their first year of law school:
Within this procedural framework, we would represent
the substantive issues at stake in the decision.
● For the computational summary, we need an expressive
Knowledge Representation (KR) language.
● How can we build a database of structured casenotes at the
appropriate scale?
● Fully automated processing of legal texts?
● Semi-automated, with a human editor in the loop?
● For either approach, we need a Natural Language (NL)
technology that can handle the complexity of legal cases.
● But in 1998, neither the NL nor the KR technology was
sufficiently advanced.
Two Steps Toward a Solution:
ICAIL '07
Contributions:
● Showed that a “state-of-the-art statistical parser ... can handle
even the complex syntactic constructions of an appellate court
judge.”
● Showed that the “semantic interpretation of the full text of a
judicial opinion can be computed automatically from the output
of the parser.” Technical specifications:
● Quasi-Logical Form (QLF).
● Definite Clause Grammar (DCG).
She has also brought this ADA suit in which
she claims that her former employer, Policy
Management Systems Corporation,
discriminated against her on account of her
disability.
526 U.S. 795 (1999)
Terms:
term(lex, var, list)
...
“She has also brought this ADA suit ... “
The petitioner contends that the regulatory
takings claim should not have been decided by
the jury and that the Court of Appeals adopted an
erroneous standard for regulatory takings liability.
526 U.S. 687 (1999)
sterm(decided,C,[_,_])
...
AND
sterm(adopted,J,[_,_])
...
[modal(should),negative,perfect,passive]
The court ruled that sufficient evidence had
been presented to the jury from which it
reasonably could have decided each of these
questions in Del Monte Dunes' favor.
526 U.S. 687 (1999)
Semantics of 'WDT' and 'WHNP': W^nterm(which,W,[])
Semanticsof 'IN': Obj^Subj^P^pterm(in,P,[Subj,Obj])
Unify: Obj = nterm(which,W,[])
	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 Term = pterm(in,P,[Subj,Obj])
Semanticsof 'WHPP':
W^Subj^P^pterm(in,P,[Subj,nterm(which,W,[])])
Semantics of 'S': E^sterm(claims,E,[_,_])
Unify: Term = pterm(in,P,[E,nterm(which,W,[])])
Tense = [present]
Semanticsof 'SBAR':
W^(E^(P^pterm(in,P,[E,nterm(which,W,[])]) &
	
 	
 	
  sterm(claims,E,[_,_]))/[present])
● How accurate are these semantic interpretations?
● Unfortunately, we do not have the data to answer this
question.
● Consider a different strategy:
● Write hand-coded extraction patterns to map information
from the QLF interpretations into the format of a structured
casenote.
● Generalize these extraction patterns by the unsupervised
learning of the legal semantics implicit in a large set of
unannotated legal cases.
● The total system would thus be engaged in a form of
semi-supervised learning of legal semantics.
Two Steps Toward a Solution:
ICAIL '15
● New Article (less technical, more intuitive):
“How to Ground a Language for Legal Discourse in a
Prototypical Perceptual Semantics”
(An edited transcript of a presentation at the Legal Quanta
Symposium at Michigan State University College of Law on
October 29, 2015)
Forthcoming in 2016 Michigan State Law Review _____.
Includes links to my more technical papers.
● Prototype Coding:
● The basic idea is to represent a point in an n-dimensional
space by measuring its distance from a prototype in several
specified directions.
● Furthermore, assuming that our initial space is Euclidean,
we want to select a prototype that lies at the origin of an
embedded, low-dimensional, nonlinear subspace, which is in
some sense “optimal”.
● The second point leads to a theory of
● Manifold Learning
● Deep Learning
● The theory has three components, drawn from:
Probability, Geometry, Logic.
● The Probabilistic Model:
This is a diffusion process determined by a potential function,
U(x), and its gradient, ∇U(x), in an arbitrary n-dimensional
Euclidean space.
The invariant probability measure for the diffusion process is
proportional to , which means that ∇U(x) is proportional to
the gradient of the log of the stationary probability density.
e
2U x
● The Geometric Model:
This is a Riemannian manifold with a Riemannian metric, ,
which we interpret as a measure of dissimilarity.
Using this dissimilarity metric, we can define a radial coordinate,
ρ, and the directional coordinates, θ1
, θ2
,...,θn– 1
, in our original n-
dimensional space, and then compute an optimal nonlinear k-
dimensional subspace.
The radial coordinate is defined to follow the gradient vector,
∇U(x), and the directional coordinates are defined to be
orthogonal to ∇U(x).
gij  x
7X7
patch
60,000 images 600,000 patches
49 dimensions
12 dimensions
sample
scan
encode
scan
14X14
patch
48 dimensions
encode
12 dimensions
encode Category: 4
12 dimensions48 dimensions
● is estimated from
the data using the mean
shift algorithm.
● at a prototype.
● The prototypical clusters
partition the space of
600,000 patches.
∇ U x
∇ U x=0
35 Prototypes
Prototype 09
Prototype 27
Prototype 30
Principal Axes
ρ
Geodesic Coordinate Curves
θ
θ
● The Logical Language:
The proposed logical language is a categorical logic based on
the category of differential manifolds (Man), which is weaker
than a logic based on the category of sets (Set) or the category
of topological spaces (Top).
For an intuitive understanding of what this means, assume that
we have replaced the standard semantics of classical logic,
based on sets and their elements, with a semantics based on
manifolds and their points. The atomic formulas can then be
interpreted as prototypical clusters, and the geometric properties
of these clusters can be propagated throughout the rest of the
language.
The same strategy can be applied to the entirety of my
Language for Legal Discourse (LLD).
Logic
Geometry
Probability
Constraints
Logic is constrained by the geometry.
Geometric model is constrained by
the probabilistic model.
Probability measure is constrained by the data.
Conjecture: The existence of these mutual constraints makes
It possible to learn the semantics of a complex knowledge
representation language.
● Why is this a “prototypical perceptual semantics”?
● It is a prototypical semantics because it is based on a
representation of prototypical clusters.
● It is a prototypical perceptual semantics because the primary
illustrations of the theory are drawn from the field of image
processing.
● Claim: If we can build a logical language on these
foundations, we will have a plausible account of how
human cognition could be grounded in human
perception.
Can We Learn
A Grounded Semantics
Without a Perceptual Ground?
● Two reasons to think this is possible:
● The theory of differential similarity is not really sensitive to
the precise details of the representations used at the lower
levels.
● There is increasing evidence that the semantics of lexical
items can be represented, approximately, as a vector in a
high-dimensional vector space, using only the information
available in the texts.
● Research Strategy:
● We initialize our model with a word embedding computed
from legal texts.
● We learn the higher level concepts in a legal domain by
applying the theory of differential similarity.
● Discussion?

Slides.ltdca

  • 1.
    On Semi-Supervised Learning OfLegal Semantics L. Thorne McCarty Rutgers University
  • 2.
    Three Papers ● 1998:Structured Casenotes: How Publishers Can Add Value to Public Domain Legal Materials on the World Wide Web. ● 2007: Deep Semantic Interpretations of Legal Texts. ● 2015: How to Ground a Language for Legal Discourse in a Prototypical Perceptual Semantics. And a Proposal: A research strategy to produce a computational summary of a legal case, which can be scaled up to a realistic legal corpus.
  • 3.
    The Challenge A structuredcasenote is a computational summary of the procedural history of a case along with the substantive legal conclusions articulated at each stage of the process. It would play the same role in the legal information systems of the 21st century that West Headnotes and Key Numbers have played in the 20th century. From my 1998 paper: Why focus on procedural history?
  • 4.
    The traditional casebrief focuses on the procedural context first: Who is suing whom, and for what? What is the plaintiff's legal theory? What facts does the plaintiff allege to support this theory? How does the defendant respond? How does the trial court dispose of the case? What is the basis of the appeal? What issues of law are presented to the appellate court? How does the appellate court resolve these issues, and with what justification? Think about the traditional “brief” that students are taught to write in their first year of law school: Within this procedural framework, we would represent the substantive issues at stake in the decision.
  • 5.
    ● For thecomputational summary, we need an expressive Knowledge Representation (KR) language. ● How can we build a database of structured casenotes at the appropriate scale? ● Fully automated processing of legal texts? ● Semi-automated, with a human editor in the loop? ● For either approach, we need a Natural Language (NL) technology that can handle the complexity of legal cases. ● But in 1998, neither the NL nor the KR technology was sufficiently advanced.
  • 6.
    Two Steps Towarda Solution: ICAIL '07 Contributions: ● Showed that a “state-of-the-art statistical parser ... can handle even the complex syntactic constructions of an appellate court judge.” ● Showed that the “semantic interpretation of the full text of a judicial opinion can be computed automatically from the output of the parser.” Technical specifications: ● Quasi-Logical Form (QLF). ● Definite Clause Grammar (DCG).
  • 7.
    She has alsobrought this ADA suit in which she claims that her former employer, Policy Management Systems Corporation, discriminated against her on account of her disability. 526 U.S. 795 (1999)
  • 8.
    Terms: term(lex, var, list) ... “Shehas also brought this ADA suit ... “
  • 9.
    The petitioner contendsthat the regulatory takings claim should not have been decided by the jury and that the Court of Appeals adopted an erroneous standard for regulatory takings liability. 526 U.S. 687 (1999) sterm(decided,C,[_,_]) ... AND sterm(adopted,J,[_,_]) ... [modal(should),negative,perfect,passive]
  • 10.
    The court ruledthat sufficient evidence had been presented to the jury from which it reasonably could have decided each of these questions in Del Monte Dunes' favor. 526 U.S. 687 (1999)
  • 11.
    Semantics of 'WDT'and 'WHNP': W^nterm(which,W,[]) Semanticsof 'IN': Obj^Subj^P^pterm(in,P,[Subj,Obj]) Unify: Obj = nterm(which,W,[]) Term = pterm(in,P,[Subj,Obj]) Semanticsof 'WHPP': W^Subj^P^pterm(in,P,[Subj,nterm(which,W,[])])
  • 12.
    Semantics of 'S':E^sterm(claims,E,[_,_]) Unify: Term = pterm(in,P,[E,nterm(which,W,[])]) Tense = [present] Semanticsof 'SBAR': W^(E^(P^pterm(in,P,[E,nterm(which,W,[])]) & sterm(claims,E,[_,_]))/[present])
  • 13.
    ● How accurateare these semantic interpretations? ● Unfortunately, we do not have the data to answer this question. ● Consider a different strategy: ● Write hand-coded extraction patterns to map information from the QLF interpretations into the format of a structured casenote. ● Generalize these extraction patterns by the unsupervised learning of the legal semantics implicit in a large set of unannotated legal cases. ● The total system would thus be engaged in a form of semi-supervised learning of legal semantics.
  • 14.
    Two Steps Towarda Solution: ICAIL '15 ● New Article (less technical, more intuitive): “How to Ground a Language for Legal Discourse in a Prototypical Perceptual Semantics” (An edited transcript of a presentation at the Legal Quanta Symposium at Michigan State University College of Law on October 29, 2015) Forthcoming in 2016 Michigan State Law Review _____. Includes links to my more technical papers.
  • 15.
    ● Prototype Coding: ●The basic idea is to represent a point in an n-dimensional space by measuring its distance from a prototype in several specified directions. ● Furthermore, assuming that our initial space is Euclidean, we want to select a prototype that lies at the origin of an embedded, low-dimensional, nonlinear subspace, which is in some sense “optimal”. ● The second point leads to a theory of ● Manifold Learning ● Deep Learning ● The theory has three components, drawn from: Probability, Geometry, Logic.
  • 16.
    ● The ProbabilisticModel: This is a diffusion process determined by a potential function, U(x), and its gradient, ∇U(x), in an arbitrary n-dimensional Euclidean space. The invariant probability measure for the diffusion process is proportional to , which means that ∇U(x) is proportional to the gradient of the log of the stationary probability density. e 2U x
  • 17.
    ● The GeometricModel: This is a Riemannian manifold with a Riemannian metric, , which we interpret as a measure of dissimilarity. Using this dissimilarity metric, we can define a radial coordinate, ρ, and the directional coordinates, θ1 , θ2 ,...,θn– 1 , in our original n- dimensional space, and then compute an optimal nonlinear k- dimensional subspace. The radial coordinate is defined to follow the gradient vector, ∇U(x), and the directional coordinates are defined to be orthogonal to ∇U(x). gij  x
  • 18.
    7X7 patch 60,000 images 600,000patches 49 dimensions 12 dimensions sample scan encode scan 14X14 patch 48 dimensions encode 12 dimensions encode Category: 4 12 dimensions48 dimensions
  • 19.
    ● is estimatedfrom the data using the mean shift algorithm. ● at a prototype. ● The prototypical clusters partition the space of 600,000 patches. ∇ U x ∇ U x=0 35 Prototypes
  • 20.
  • 21.
  • 22.
    ● The LogicalLanguage: The proposed logical language is a categorical logic based on the category of differential manifolds (Man), which is weaker than a logic based on the category of sets (Set) or the category of topological spaces (Top). For an intuitive understanding of what this means, assume that we have replaced the standard semantics of classical logic, based on sets and their elements, with a semantics based on manifolds and their points. The atomic formulas can then be interpreted as prototypical clusters, and the geometric properties of these clusters can be propagated throughout the rest of the language. The same strategy can be applied to the entirety of my Language for Legal Discourse (LLD).
  • 23.
    Logic Geometry Probability Constraints Logic is constrainedby the geometry. Geometric model is constrained by the probabilistic model. Probability measure is constrained by the data. Conjecture: The existence of these mutual constraints makes It possible to learn the semantics of a complex knowledge representation language.
  • 24.
    ● Why isthis a “prototypical perceptual semantics”? ● It is a prototypical semantics because it is based on a representation of prototypical clusters. ● It is a prototypical perceptual semantics because the primary illustrations of the theory are drawn from the field of image processing. ● Claim: If we can build a logical language on these foundations, we will have a plausible account of how human cognition could be grounded in human perception.
  • 25.
    Can We Learn AGrounded Semantics Without a Perceptual Ground? ● Two reasons to think this is possible: ● The theory of differential similarity is not really sensitive to the precise details of the representations used at the lower levels. ● There is increasing evidence that the semantics of lexical items can be represented, approximately, as a vector in a high-dimensional vector space, using only the information available in the texts.
  • 26.
    ● Research Strategy: ●We initialize our model with a word embedding computed from legal texts. ● We learn the higher level concepts in a legal domain by applying the theory of differential similarity. ● Discussion?