Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RELEVANCE

196 views

Published on

Recent progress in incorporating word order and semantics to the decades-old, tried-and-tested bag-of-words representation of text meaning has yielded promising results in computational text classification and analysis. This development, and the availability of a large number of legal rulings from the PTAB (Patent Trial and Appeal Board) motivated us to revisit possibilities for practical, computational models of legal relevance - starting with this narrow and approachable niche of jurisprudence. We present results from our analysis and experiments towards this goal using a corpus of approximately 8000 rulings from the PTAB. This work makes three important contributions towards the development of models for legal relevance semantics: (a) Using state-of-art Natural Language Processing (NLP) methods, we characterize the diversity and types of semantic relationships that are implicit in practical judgements of legal relevance at the PTAB (b) We achieve new state-of-art results on practical information retrieval using our customized semantic representations on this corpus (c) We outline promising avenues for future work in the area - including preliminary evidence from human-in-loop interaction, and new forms of text representation developed using input from over a hundred interviews with practitioners in the field. Finally, we argue that PTAB relevance is a practical and realistic baseline for performance measurement - with the desirable property of evaluating NLP improvements against “real world” legal judgement.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RELEVANCE

  1. 1. UPDATED DRAFT FOR ASAIL WORKSHOP, JUNE16, 2017 ICAIL 2017: https://nms.kcl.ac.uk/icail2017/icail2017.php ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RELEVANCE Kripa Rajshekhar Metonymy Labs Chicago, USA kripa@metolabs.com Wlodek Zadrozny Department of Computer Science UNC Charlotte, USA wzadrozn@uncc.edu Sri Sneha Varsha Garapati Department of Computer Science UNC Charlotte, USA sgarapat@uncc.edu ABSTRACT 1 Recent progress in incorporating word order and semantics to the decades-old, tried-and-tested bag-of-words representation of text meaning has yielded promising results in computational text classification and analysis. This development, and the availability of a large number of legal rulings from the PTAB (Patent Trial and Appeal Board) motivated us to revisit possibilities for practical, computational models of legal relevance - starting with this narrow and approachable niche of jurisprudence. We present results from our analysis and experiments towards this goal using a corpus of approximately 8000 rulings from the PTAB. This work makes three important contributions towards the development of models for legal relevance semantics: (a) Using state-of-art Natural Language Processing (NLP) methods, we characterize the diversity and types of semantic relationships that are implicit in practical judgements of legal relevance at the PTAB (b) We achieve new state-of-art results on practical information retrieval using our customized semantic representations on this corpus (c) We outline promising avenues for future work in the area - including preliminary evidence from human-in-loop interaction, and new forms of text representation developed using input from over a hundred interviews with practitioners in the field. Finally, we argue that PTAB relevance is a practical and realistic baseline for performance measurement - with the desirable property of evaluating NLP improvements against “real world” legal judgement. 1 Produces the permission block, and copyright information †​ The full version of the author’s guide is available as acmart.pdf document It is a datatype. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). © 2017 Copyright held by the owner/author(s). Keywords patent litigation, text analytics, semantic search, data sets 1. INTRODUCTION Recent progress in incorporating word order and semantics to the decades-old, tried-and-tested bag-of-words representation of text meaning has yielded promising results in computational text classification and analysis. This development, and the availability of a large number of legal rulings from the PTAB (Patent Trial and Appeal Board), a special court instituted by the United States Congress as part of the America Invents Act in 2011 motivated us to revisit possibilities for practical, computational models of legal relevance - starting with this narrow and approachable niche of jurisprudence. In addition to developing practical models for legal relevance, we are motivated by the clear need of practitioners in the area of Patent law for tools to more efficiently improve the quality of outcomes. A 1970 Stanford Law Review paper [2] offered prescient remarks for the field of AI and Law and concluded with a number of potential implications, among which we noted the following: “Lawyers might rely too heavily on a restricted, and thus somewhat incompetent, system with a resulting decline in the quality of legal services”. Would that remark apply to the tools available in Patent Law today? Recent analysis of litigation outcomes suggest that “nearly half of all patents litigated to judgment were held invalid” [1]. Furthermore, the need for more thorough research and preparation of quality patents is perhaps as strong as ever: US Patent quality appears to be lagging international peers and the US Patent and Trademark Office (USPTO) initiated its quality improvement initiative with a pilot announced on July 11, 2016. We suggest that a computational representation of legal relevance should include a reasonably small set of computable models that capture the common modes of 1
  2. 2. UPDATED DRAFT FOR ASAIL WORKSHOP, JUNE16, 2017 ICAIL 2017: https://nms.kcl.ac.uk/icail2017/icail2017.php abductive reasoning used by practitioners exercising legal judgment within a specific area of the law. The models are considered adequate in aggregate under some arbitrarily reducible measure of prediction accuracy, across a corpus selected from that specific area of the law. The purpose of this paper is to outline our approach to the development and testing of several computational models for legal relevance in the narrow domain of patent law, specifically as documented through select proceedings of the USPTO PTAB cases. For our tests we use a collection of Ex Parte Reexamination (EPR) patent case rulings, including the patents explicitly mentioned in decisions of the PTAB. We present results from our analysis and experiments towards this goal using a corpus of approximately 8000 rulings from the PTAB. This work makes three important contributions towards the development of models for legal relevance semantics: (a) Using state-of-art Natural Language Processing (NLP) methods, we characterize the diversity and types of semantic relationships that are implicit in select judgements of legal relevance at the PTAB (b) We achieve new state-of-art results on practical information retrieval tasks using our customized semantic representations on this corpus (c) We outline promising avenues for future work in the area - including preliminary evidence from human-in-loop interaction, and new forms of text representation developed using input from over a hundred interviews with practitioners in the field. Finally, we note that PTAB relevance as a baseline for performance measurement, is a practical and realistic baseline in contrast to more popular citation-based metrics- with the desirable property of measuring technology improvements against “real world” legal judgement. For example, in contrast to recent results [8], we find that documents not in the semantic neighborhood of the query document, are often still legally relevant to the query. The inadequacies of using citations were also discussed in different context by researchers studying innovation [14, 17]. Together, these results illustrate the need for better data sets that reflect real legal judgement, not just citations. The remainder of the paper is organized as follows: In Section 2 we discuss the practical motivations and practitioners requirements of prior art search. Section 3 introduces the data set. The results are presented in Section 4, of which Subsection 4.2 gives the details of our experiments. Since the experiments reveal limitations to address, the question is how we go about building better models for this purpose – this is discussed in Section 5. Conclusions (Section 6) summarize our results. 2. PRACTITIONER REQUIREMENTS Given the nuance and complexity implicit in legal judgement, we are skeptical that a one-size-fits-all “magic-bullet” AI solution will adequately model outcomes in the field. Furthermore, comparing the current state of art to legal information retrieval over 50 years ago [7], we observe that changes in algorithms and models of text representation have lagged far behind the dramatically improved access to data and growth in computational power. This has been noted by others, for example in discussing the inadequacies of leading search engines [8]. We believe the shortcomings are in part due to the lack of practical methods for computational modelling and for representing legal relevance. Towards this end, we see this paper as a small part of a broader undertaking: the development of practical models of legal relevance that can be shared, added to and built upon by practitioners and researchers alike. While this work focuses on patent law, there are synergies with work in other areas that bring domain aware case factors into computer models [16]. While limited scholarly attention has been given to the requirements of practitioners in patent litigation and related areas, we were able to use informal interviews and literature in the area of complex search to identify a few themes of interest. We seek to explore some of these themes further in this paper and in future work. In particular, this paper is focused on the more foundational topic of modelling legal relevance. These models are likely to be helpful in the practical work of legal professionals and the evaluation of legal procedures across the field to improve the quality of patent grant and enforcement procedures. A model of relevance is also arguably a precondition for an explainable computational theory of semantics in the domain. Patent cases have substantial uncertainty [12], primarily due to the challenges implicit in knowing the entire universe of prior art before litigation commences and reconciling the case at hand with relevant prior case law: “difficulty in knowing the relevant facts to the dispute and difficulty in knowing how a trier of fact will evaluate the 2
  3. 3. UPDATED DRAFT FOR ASAIL WORKSHOP, JUNE16, 2017 ICAIL 2017: https://nms.kcl.ac.uk/icail2017/icail2017.php facts… knowing the entire universe of prior art is impossible before litigation commences”[12]. We note that the typical litigation workflow is accompanied by a diverse set of requirements at different phases of the process -- for instance, exploration of case law and the technology landscape at the outset of a case, followed by an analysis of semantically and contextually linked outcomes relevant to the matter at hand, and then assistance in selecting and narrowing in on more specific artifacts (for example, highly relevant patents) to be used in the preparation for potential litigation. We restrict ourselves to the last step of the patent litigation workflow. Identifying highly relevant prior art is a particular use case of interest in this paper. After a patent is granted, its validity can be challenged in litigation or in several post-grant proceedings. In the majority of these challenges, it is necessary to find and examine a number of documents from a potentially very large pool of patent and technical literature; that is, establish the relationship of the invention to the prior art. For the purpose of this paper, we do not need to get into the legal differences between different types of proceedings . Also, we do not need to attend to the 2 differences between different patent jurisdictions, because the technical problems of text analytics and information retrieval are the same. Finding references potentially invalidating a patent is perhaps more challenging than finding (some) relevant prior art. For example, the average number of cited references in a patent is about 40 , while the number cited 3 in invalidation decisions is usually less than 5. Arguably, any patent search supporting invalidation has to be very precise. Finding such relevant documents is non-trivial, because many documents refer to the same concepts that describe the invention at hand, and these documents can appear in multiple patent classes and broad scientific and technical literature. Moreover, similar concepts, relations and functionalities might be expressed in different words, so key-word search is not sufficient to find all relevant documents. Therefore this search process is labor 2 For example, http://www.pillsburylaw.com/post-grant-proceedings or http://fishpostgrant.com/post-grant-review/. See also https://en.wikipedia.org/wiki/Patent_Trial_and_Appeal_Board and http://www.uspto.gov/patents-application-process/patent-trial-and-appeal-boa rd-0. 3 http://patentlyo.com/patent/2015/08/citing-references-alternative.html intensive, costly and possibly error prone, even with the support of modern information retrieval tools. Analyzing a collection of patents and related product or scientific literature is also costly, mostly because it takes time and requires highly trained workforce (lawyers and domain experts). What is important from our perspective, there are few analytic tools that can support this process. Most of the patent analytics tools analyze metadata , for example probabilities of finding a patent 4 invalid based on statistics on trial location, examination art-unit, etc. Allison et al. [1] provide an in-depth analysis of the “Realities of Modern Patent Litigation” relating “the outcomes (…) to a host of variables, including variables related to the parties, the patents, and the courts”. Our goal as technology developers lies in improving patent analytic tools; our goal as researchers is to understand the obstacles on this path, and finding ways of avoiding them. We note that legal reasoning is abductive since the models implicit in particular cases are individually neither necessary, nor sufficient, to explain all cases, but rather, are good enough to model outcomes in only some reasonable sample of cases. For instance, our analysis shows that aggregate document level semantic relatedness is an adequate mode of reasoning in only a small minority of USPTO Ex Parte Reexamination (EPR) cases. Manual examination of cases with variance demonstrates that while relevant terms and semantic links are present, a high frequency of related words and phrase occurrence is neither a necessary nor a sufficient condition for legal relevance. Before we discuss the models, let us say a few words about the data we use to test them. 3. PATENT TRIAL AND APPEAL BOARD (PTAB) DATA SETS Post grant review and Inter Partes Review (IPR) is conducted at the USPTO Patent Trial and Appeal Board (PTAB) and is aimed at reviewing the patentability of one or more claims in a patent. It begins with a third party petition to which the patent owner may respond. A post grant review is instituted if it is more likely than not that at least one claim challenged is patentable. If the petition is not dismissed, the Board issues a final decision within 1-1.5 year . Chien and 5 4 E.g. https://lexmachina.com/legal-analytics/ 5 3
  4. 4. UPDATED DRAFT FOR ASAIL WORKSHOP, JUNE16, 2017 ICAIL 2017: https://nms.kcl.ac.uk/icail2017/icail2017.php Helmers [4] discuss “Inter Partes Review and the Design of Post-Grant Patent Reviews” processes and key statistics, including the statistics of case dispositions. USPTO notes that 80% of IPR reviews end with some or all claims invalidated 6 What is in the PTAB data? Patent Trial and Appeal Board (PTAB) publicly available dataset, as of Jan 2017 has about 100 zip files containing 10 GB of data (compressed) . These files are either image or text .pdf files 7 with PTAB decisions. Each decision pertains to the validity of claims of one patent. Why care about PTAB data? Because each case has a relatively small collection of highly relevant documents used as evidence. The outcomes are clear and the reasoning can be modeled. There’s enough data for statistical inference (although perhaps not enough to train a neural net from scratch). Also, as mentioned earlier, the PTAB data represent practitioner needs, better than the more commonly used citation graphs. In this paper we report on some initial experiments on the PTAB data sets. Given the relatively structured form of the data available and the more streamlined process used in adjudication, we believe that PTAB data represents a unique training corpus to develop and improve customized tools used in the areas of patent litigation and licensing. As we discuss later, it serves as a better measure of satisfying practitioner needs than citations retrieval. 4. EXPERIMENTS AND RESULTS FROM SEMANTIC ANALYSIS OF PTAB RULINGS Encouraged by the recent development in neural language model representations [11], and the availability of a rich corpus of documents capturing relevance judgements in Patent Law, we sought to explore the extent to which a computational theory of semantic relevance in this area of law was possible. As described in the section on practical motivation, such a theory would be of great utility to practitioners and policy makers in this area of law. Our experimental approach is therefore both theoretically and practically motivated, empirical but with an emphasis on http://www.uspto.gov/patents-application-process/appealing-patent-decisions/ trials/post-grant-review 6 http://www.uspto.gov/patents-application-process/patent-trial-and-appeal-boa rd/statistics 7 Available at: https://bulkdata.uspto.gov/data2/patent/trial/appeal/board/ exploring possibilities and limits of such a theory. For the experiments, we use a sample of 8000 EPR rulings from the USPTO Final Decisions of the PTAB. Our experiments use subsets of the data to (i) perform an analysis of relationships between the pairs of patents associated in the rulings, (ii) conduct an assessment of the impact types of semantic representations have on the practical task of patent retrieval for legal relevance and (iii) empirically explore possibilities for alternate forms of text representation to model legal relevance, that enable effective human-in-loop interaction. 4.1. Details of the experiments Our tests consist in using different techniques to retrieve patents cited in PTAB decisions, based on queries built only from the patent whose validity is being questioned, with no additional case specific practitioner input. Such queries typically consisting of combinations of the patent abstract, its title, or its first claim. As baselines for our evaluations we used both bag-of-words (BOW) query representations, and semantic search implemented using conceptual expansion of query words. The conceptual expansion was implemented using Wikipedia derived related concepts, comparable to well-understood approaches [8, 13]. Experiment 1. ​To evaluate the hypothesis that aggregate document level semantic relatedness is an important factor in a potential model for legal relevance in the patent domain, we attempted to quantify the correlation between semantic relatedness and patent relevance using a sample of 245 semiconductor EPR cases. In this sample, Recall at 1000 was 30% (in 30% of the cases, a patent from the ruling was found within the top 1000 results retrieved using the case subject patent as query). The point is that this measure of semantic similarity is inadequate to capture PTAB relevance: for only 30% of the subject patents did the 1000th ranking patent document returned by our state-of-art semantic-relatedness model score less than the PTAB relevant patent using the well understood cosine-similarity measure of relatedness. This drops to 15% when the 100th document is considered. This result was not surprising to the lawyers and practitioners that we interviewed, who have long insisted that finding relevant patents is more art than science. However, these results also provides an interesting contrast to the assumptions of other researchers such as Khoury and Bekkerman [8] who suggest that “if a given document is not in the semantic neighborhood of the query document, it 4
  5. 5. UPDATED DRAFT FOR ASAIL WORKSHOP, JUNE16, 2017 ICAIL 2017: https://nms.kcl.ac.uk/icail2017/icail2017.php simply cannot be relevant for the query document". Our work challenge, with experimental results, the understandable intuition of many information retrieval researchers, that relevant prior-art must necessarily be found in the set of documents that have a high degree of semantic similarity as measured by state-of-art text processing methods. We observe that this insight would have been unavailable without the PTAB data, since prior works does not use legal judgement to model relevance. This experiment demonstrates the need for other methods of association, a greater variety of semantic connections, and perhaps more sophisticated goal/argument oriented interpretation of the claim language. Experiment 2. ​To quantify the improvement possible through the use of better semantic representations, we benchmarked a Word2Vector (W2V) [11] model trained on pre-processed text, grouped by subsector of patents. In specific, we classify each patent to one of 37 industry groupings (e.g. Computer Hardware & Software, Metalworking). The groups correspond to the standard NBER subcategories. The claim text for each of the patents was then modified to include references to special words that uniquely identified each patent as a new word in our vocabulary (e.g. _6435262_ for the patent US 6435262 B1). Cross references, including citations, between claims were then tagged at the claim level with the relevant unique patent identifiers to improve locality sensitive mapping between a patent and the various claims that related to it. The training of the word vector models was then carried out individually for each of the 37 preprocessed text corpora, providing us with a W2V model corresponding to each of the industry groups (Skip-gram training, with 200 dimension vector representation and minimum word count of 4 was used). Given the trained W2V [11] models, a simple semantic retrieval task then amounts to finding the closest patent identifier word (treated as a special word in the vocabulary) to the identifier for a patent of interest. Proximity in our case was measured with the commonly used cosine similarity measure. This measure, or relative ranking, could be further improved upon with additional semantically important representations to more closely model the type of relevance desired - in our case, the relevance of two patent documents based on PTAB guidelines. We make some suggestions along these lines in our experiments on human-in-loop emulation. We used 1500 PTAB pairs in this test. Using bag-of-words with conceptual query expansion resulted in a 4.9% sample match for Recall @ 100, and was indistinguishable from using simple Bag-Of-Words (BOW), and thus either could constitute a baseline. However, using the subsector specific model resulted in a significant improvement: Recall @ 100 of 19%. This increase in performance is primarily driven by the more accurate modelling of industry sector specific relationships between the patents and language use. (e.g. “System” in Computer Hardware & Software means something quite different than “System” in Mining) Experiment 3. ​In this experiment we attempted to quantify the relative impact of elementary human-in-loop intervention on retrieval performance. We observed instances where simple reranking of search results based on user feedback on positive/negative document examples, allows for a matching document that was ranked below 5000 to be retrieved in the top 100 in one-step of user feedback. For example, disambiguating the erroneous sense in which the acronym “ATM” was used -- the Asynchronous Transfer Mode telecommunication network technology, in contrast to the intended payment terminal technology or Automatic Teller Machine, sense of the term. Since both are technologies that reference terms like “network”, “communication” and “interface” this differentiation was not sufficiently available in the unsupervised W2V models. To further test this intuition we attempted to emulate the action of a user applying simple heuristics to improve the results, by eliminating groups of retrieved patents that on simple visual inspection are unlikely to be relevant matches. We measure the impact of such intervention as the improvement in Recall performance. For example, we show that Recall @ 200 without intervention is approximately 10% but increases to approximately 15% using a simple intervention based on human-in-loop like heuristic intervention. In specific, ​using 90 PTAB patent pairs data, we attempted to emulate human-in-loop behavior using a coarse method of additional screening. While the numbers are small, they indicate the potential for improvement in recall performance using a comparable human-in-loop feedback that relies on actual user judgement (versus the emulated approach in our experiment). The specific filters we used: the patent pairs considered are of the same ​Type (i.e. ‘device’, ‘method’, ‘system’ or ‘other’ using corresponding key-words in the first claim). In addition, we used their ​Aboutness (represented by the first noun, adjective and verb in the same claim) and Verb Signature (most frequently cited verbs in the same claim) share at least one word with the patent that is the 5
  6. 6. UPDATED DRAFT FOR ASAIL WORKSHOP, JUNE16, 2017 ICAIL 2017: https://nms.kcl.ac.uk/icail2017/icail2017.php focus of the PTAB decision. We dropped the other results that do not meet the criteria. The rest of the result frame (top 100 ranking retrieved patents) are filled with other top semantically sorted search results. In another test, we also considered Claim 1 length, dropping patents with first claim longer than 200 or shorter than 10. Operating on retrieved results using ​Type, Aboutness, Verb Signatures features as filters had significant scope, as measured in terms of retrieved results impacted. Figure 1: Experiment 3. Results from Human-In-Loop Emulation. The blue line is the recall baseline using semantic search. The red line show shows adjustments based on type, ‘aboutness’ of the claim, and verb pattern. The red line adds a length of claim 1 filter (dropping very short and very long claims). This experiment was performed using 90 patent pairs derived from PTAB data. The convergence of lines at small and large values suggests that proper calibration of human-in-the-loop tools will be crucial. It is worth noting that small changes in filters, ​often impact thousands of results at a time. For example, the top 20 most frequent ​Aboutness ​and Verb Signatures words could, through 'OR' operations, span tens of thousands of results. This is another argument for a human-in-the-loop approach. Experiment 4. ​To evaluate other forms of representation that allow a more granular, but human understandable, control of results, we explored a simple set of words model of claim language to augment the human-in-loop methods described above. For this experiment, patent claims were processed into phrase chunks -- unordered word sets (1-10 in length). Each patent typically has 50-75 such unique word-sets, 50% of these chunks were unigrams. The relatedness of two patents could then be implemented with easier to intuit user input implemented as chunk (set of words) inclusion/exclusion. Our experiments showed that this representation had discriminating power and could be a candidate for further human-in-loop experimentation. Using the representation of abstracts for 100 PTAB pairs, the charts show the comparison of count of word set intersection divided by the size of the subject word set. The LHS chart in “Histograms of word-set overlap comparing PTAB relevant pairs ….” Figure 2(a) is for the actual relevant pairs. RHS chart in Figure 2(a) is for the same subject patent and a randomly selected patent from the list of approximately 300 patents in the set. We note that a set intersect measure > 10% correlates with patent relevance in 80% of the cases. (The remaining 20% could be cases where claim language, detailed spec or other features drove the relevance match even though abstract language didn't have this set intersect match). We note that a 10-40% set-intersect accounts for a large majority of the matching pairs. Figure 2 (a).  Histograms of word-set overlap comparing PTAB  relevant pairs (LHS) Vs. Random pairs (RHS), illustrating the  extent to which degree of word-set overlap is correlated with  relevance.  To evaluate the ability of this form of representation to discriminate between semantically related, but not legally relevant patents the LHS chart in Figure 2(b) shows the max word-set intersection for 100 PTAB subject patents and the top 50 semantically related patents, returned by our best performing subsector-language trained Word2Vector model. The RHS chart in Figure 2(b): “Histograms of word-set overlap comparing overlap between PTAB subject patents and their Top-50 semantically ranked patents” shows the random overlap from Figure 2(a) for comparison. We note that a majority of the top 50 semantically similar documents pass the word-set intersect threshold for 6
  7. 7. UPDATED DRAFT FOR ASAIL WORKSHOP, JUNE16, 2017 ICAIL 2017: https://nms.kcl.ac.uk/icail2017/icail2017.php "match" as evaluated against the measurement from relevant pairs versus randomly selected pairs. Since semantic similarity generally coincides with bag-of-word similarity, this is an unsurprising result. We expect that multiple candidates will match based on the set overlap representation but be readily amenable to human-in-feedback given the ease with which a user can examine and filter the underlying word-set representation. From our interviews with practitioners in the field, we anticipate that this type of representation, that allows intuitive human ​intervention, could offer a convenient approach to dynamically adjust the search landscape and reveal better quality candidates for human intuition. Figure 2 (b). Histograms of word-set overlap comparing overlap between PTAB subject patents and their Top-50 semantically ranked patents (LHS) Vs. Random pairs (RHS), illustrating the higher prevalence of word-set correlation of semantically related examples in contrast to PTAB pairs of Figure 2 (a). 4.2. Results We report three main results; the first two concerning the viability of a semantic representation in the domain that yields practically useful results, and the third result in representation and human-in-loop retrieval methodology, which we see as having the greatest promise, and suggesting a path for future development. We observe: Result 1. Less than 15% of patents judged as being relevant according to a PTAB ruling appear to have stronger aggregate semantic relatedness with other patents that share word or topic other associations than with each other. For instance, two patents responsive to the topic “foundry” -- that both deal with metal forming -- may not be highly related to each other (in the sense of contributing to patent invalidation), although they may be a strong semantic match. However, one that deals with slurrying and foundry processes, which is related, but not a strong semantic conceptual match, does indeed invalidate one of the patents in another test case. Result 2. Improving patent text representation through preprocessing steps and using word embeddings of patent specific keywords, and distinct sub-sector specific training, improves recall performance four-fold from 5% to 20%. As a performance measurement baseline, we used BOW modified with a concept representation along the lines of [8] and [13]. We focused on the information retrieval (IR) task of finding patents cited in PTAB documents. where 5% of the test sample recalled the relevant document in the top 100 results. Using our customized sub-sector specific semantic model representation, we were able to retrieve the relevant document in the top 100 results in 20% of the cases. This result promises future improvement from methods that model semantic relatedness in more granular ways, down from sub-sector specific to perhaps patent-type (device, method, apparatus) specific modelling of relevance. Result 3. Consistent with improvements demonstrated by others, we have preliminary evidence from our experiments with Human-in-loop retrieval that point to substantial performance improvement in particular cases. To improve methods of incorporating user feedback, we have also evaluated simple forms of patent language representation to approximate the heuristics that practitioners rely on in their search strategies, such as term overlap, sector information, ‘what the patent is about’ and Claim 1 length. We show that a single iteration of such filtering can improve performance by 50% as measured by recall @200. However, at other recall values (smaller and larger) the improvements are lower, so the proper calibration of human in the loop tools will be crucial. 5. DISCUSSION In this section we comment on our use of distributional approaches vs. traditional NLP techniques, as well as the advantage of the human in the loop approach. We also add some comments on a possibility of a computational models of legal relevance. 5.1. Why only use distributional approaches? One possible objection towards the distributional approaches is that they do not capture deeper semantic meanings of patent texts or claims. However, the current state of natural language processing suggests we need new tools to capture finer distinction in meanings, beyond the 7
  8. 8. UPDATED DRAFT FOR ASAIL WORKSHOP, JUNE16, 2017 ICAIL 2017: https://nms.kcl.ac.uk/icail2017/icail2017.php standard words to syntax to semantics pipeline. Experiments with parsing patent claim seem to show that none of the existing tools can do it. This is not surprising. The average length of Claim 1 in between 150 and 200 words, when measured in a few weekly samples of US patents in 2016. In addition, more than 90% of the first claims are longer than 50 words. Average sentence length in the Wall Street Journal Corpus is 19.3 words [15], ranging from 3 to 20. And most natural language parsers are trained on similar corpora. We also know that parsing accuracy decreases with the length of the sentence. For example, McDonald and Nivre [10; Fig. 2] show that parsing accuracy drops 10 points or more per 40 words, and similar results appear elsewhere [5]. Even worse, Boullier and Sagot [3] entertain a possibility that “(full) parsing of long sentences would be intractable” (long, meaning more than 100 words). This means that an analysis of the structure for an average crucial Claim 1 is likely to be wrong, until we create a better sentence analysis tools. 5.1. Additional comments on human in the loop The result of patent human-in-loop retrieval methods are promising. They are also consistent with results from search query modification experiments [6], where “baseline performance can be doubled if only one relevant document was manually provided by the user”. Similarly, the IBM Watson system is advocating an interactive approach to symptoms classification [9]. As we show, simple human-in-loop intervention by means of filtering of results using heuristically selected word features (e.g. type of patent, earliest appearing noun-adjective-verb sequence) can modify rankings significantly, with preliminary evidence illustrating potential for over 50X improvement, where a retrieval task failed Recall @ 5000 but passed Recall @100 with user feedback. Simple methods, a three-word summary or a patent’s aboutness prove to be a productive way for the user to include/exclude groups of patents or claim language, closely mimicking the skimming of the detailed text that is performed by the expert human reader in practice. These improvements in practically inspired forms of patent text and semantics representation are likely to require language models tuned to the specific nuances of the text in this narrow domain. We need more of these. Furthermore, through over a hundred interviews with industry practitioners to understand their expectations of search tools and common practices, we recognize the importance of alternative forms of representation, essential to support the different points of view implicit in judging relevance. Again, we are planning to experiment with other representations. 5.3. Towards a computational theory of legal relevance Incorporating word order and semantics in text representation is a recent win for the field of NLP [11]. Based on the results of our experiments and through interviews with practitioners, we believe that a one-size-fits-all semantic search approach utilizing these advancements is incapable of capturing the nuanced relevance judgements made in the domain of patent litigation. We demonstrated orders of magnitude improvement in practically relevant task performance through modification of semantic representation models, using preprocessing of text, customized forms of relevance representation of claims, such as their ​aboutness ​and the use of human-in-loop feedback to better curtail the possibilities from a list of semantically relevant documents. We observed that given the abductive nature of legal relevance more foundational work on representation as well as a descriptive taxonomy of the patterns of relevance expected by legal practitioner is likely to help in better performing semantic representations. This will require further engagement with the legal community to ensure that the computational work and machine learning protocols are guided by specific intuitions and areas of focus of practitioners. The benefit of this work is likely to be two-fold: (i) better performance of patent search and analytic systems that could support the quality and efficiency of the work in the field and (ii) a more in depth understanding of the implicit criteria embodied in the legal record of decades of judgements and rulings that could serve as a valuable learning and policy tool for the field at large. 5. SUMMARY AND CONCLUSIONS In this paper we introduced a new data set relevant for patent retrieval, and more generally, for modeling legal relevance, namely a collection of rulings from the USPTO Patent Trial and Appeal Board (PTAB). We have used eight thousand documents from this data set to perform a collection of experiments. These experiment show the need for new models of relevance. We 8
  9. 9. UPDATED DRAFT FOR ASAIL WORKSHOP, JUNE16, 2017 ICAIL 2017: https://nms.kcl.ac.uk/icail2017/icail2017.php presented and evaluated a number of such models based on distributional and structural features of patent data. We also argued that we need a new collection of approaches to computational modeling of relevance, and the most promising avenues of research will have to include an interactive, human in the loop approach. In addition, using the PTAB data set for testing relevance in patent document retrieval, instead of traditional citations search, shows a bigger gap between the needs of practitioners and the capabilities of current information retrieval and NLP technologies. Consequent to our conclusions, we believe future work in this area would include three major streams: (i) The hypothesizing and testing of semantic representations that allow for automated classification of historically observed modes of legal relevance (e.g. do PTAB rulings where patents were found to be relevant to individual claim level semantic relatedness explain more cases than document level semantic relatedness?) (ii) Incorporating novel human-in-loop feedback methods and then back testing their performance in practically valuable litigation scenarios such as PTAB IPR datasets (iii) Input from legal theorists and practitioners on the accurate classification of modes of relevance judgements (e.g. enumeration of the types and forms of arguments typically used in support of legal relevance judgements). REFERENCES [1] Allison, J.R, MA Lemley, & D.L. Schwartz (2014). Understanding the Realities of Modern Patent Litigation . Texas Law Review 1769 (2014;. Available at SSRN: http://ssrn.com/abstract=2442451 [2] Buchanan, B., & Headrick, T (1970). Some Speculation About Artificial Intelligence and Legal Reasoning. Stanford Law Review, Volume 23, No. 1, November 1970 [3] Boullier,P. & B. Sagot (2005). Efficient and robust LFG parsing: SXLFG. Proceedings of the Ninth International Workshop on Parsing Technologies (IWPT). http://www.aclweb.org/anthology/W05-1501 [4] Chien, C.V & C. Helmers (2015). Inter Partes Review and the Design of Post-Grant Patent Reviews. Santa Clara Univ. Legal Studies Research Paper No. 10-15. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=260156 2 [5] Choi,J.D., J. Tetreault and A. Stent (2015) It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. http://www.aclweb.org/anthology/P/P15/P15-1038.pdf [6] Golestan Far M., S. Sanne, (2015) “On term selection techniques for patent prior art search,” in Proceedings of the 38th International ACM SIGIR Conference. [7] Horty, J. (1962). The "Key Words in Combination" Approach. MULL: Modern Uses of Logic in Law, Vol. 3, No. 1 (MARCH 1962), pp. 54-64 [8] Khoury, A. and R. Bekkerman. "Automatic Discovery of Prior Art: Big Data to the Rescue of the Patent System, 16 J. Marshall Rev. Intell. Prop. L. 44 (2016)." The John Marshall Review of Intellectual Property Law 16.1 (2016): 3. [9] Lally, A. et al (2014). WatsonPaths: scenario-based question answering and inference over unstructured information. IBM Research (2014). RC25489 (WAT1409-048) September 17, 2014 [10] McDonald,R & J. Nivre (2007). Characterizing the Errors of Data-Driven Dependency Parsing Models. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. http://www.aclweb.org/anthology/D07-1013. [11] Mikolov, T. et al. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems. [12] Schwartz, D.L. (2012). The Rise of Contingent Fee Representation in Patent Litigation. Alabama Law Review 335 (2012). Available at SSRN: http://ssrn.com/abstract=1990651 [13] Shalaby, W. and W. Zadrozny (2015). Measuring Semantic Relatedness using Mined Semantic Analysis. arXiv preprint arXiv:1512.03465. [14] Strumsky, D. and J. Lobo. 2015. Identifying the sources of technological novelty in the process of invention. Research Policy (44/8). [15] Strzalkowski, T. (ed.) (1999). Natural Language Information Retrieval. Springer. [16] Wyner, A. & Peters, W., Lexical Semantics and Expert Legal Knowledge Towards the Identification of Legal Case Factors, Proc. JURIX 2010, 127-136. [17] Youn, H. et al; (2015). Invention as a combinatorial process: evidence from US patents. Journal of The Royal Society Interface 12 106 20150272. The Royal Society 9

×