Linguistic processing techniques like morphological analysis and use of ontologies can improve recall for document characterization in legal discovery by expanding search queries. Semantic analysis of documents and queries can improve precision of searches by returning only documents that precisely match the intended relationships between entities. Linguistic processing can also aid redaction of sensitive information by better detecting entities and relations. While more computationally intensive than keyword searches, these techniques can scale to large document collections through two-stage processing and creation of semantically indexed resources.
This is an update of a talk I originally gave in 2010. I had intended to make a wholesale update to all the slides, but noticed that one of them was worth keeping verbatim: a snapshot of the state of the art back then (see slide 38). Less than a decade has passed since then but there are some interesting and noticeable changes. For example, there was no word2vec, GloVe or fastText, or any of the neurally-inspired distributed representations and frameworks that are now so popular. Also no mention of sentiment analysis (maybe that was an oversight on my part, but I rather think that what we perceive as a commodity technology now was just not sufficiently mainstream back then).
Also if you compare with Jurafsky and Martin's current take on the state of the art (see slide 39), you could argue that POS tagging, NER, IE and MT have all made significant progress too (which I would agree with). I am not sure I share their view that summarisation is in the 'still really hard' category; but like many things, it depends on how & where you set the quality bar.
We need to start understanding documents within an electronic machine procesable environment. Such conception goes beyond the PDF and HTML; it entails, I argue, understanding the document as a fluid aggregator.
September 2021: Top10 Cited Articles in Natural Language Computingkevig
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
This is an update of a talk I originally gave in 2010. I had intended to make a wholesale update to all the slides, but noticed that one of them was worth keeping verbatim: a snapshot of the state of the art back then (see slide 38). Less than a decade has passed since then but there are some interesting and noticeable changes. For example, there was no word2vec, GloVe or fastText, or any of the neurally-inspired distributed representations and frameworks that are now so popular. Also no mention of sentiment analysis (maybe that was an oversight on my part, but I rather think that what we perceive as a commodity technology now was just not sufficiently mainstream back then).
Also if you compare with Jurafsky and Martin's current take on the state of the art (see slide 39), you could argue that POS tagging, NER, IE and MT have all made significant progress too (which I would agree with). I am not sure I share their view that summarisation is in the 'still really hard' category; but like many things, it depends on how & where you set the quality bar.
We need to start understanding documents within an electronic machine procesable environment. Such conception goes beyond the PDF and HTML; it entails, I argue, understanding the document as a fluid aggregator.
September 2021: Top10 Cited Articles in Natural Language Computingkevig
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
Technical Whitepaper: A Knowledge Correlation Search Engines0P5a41b
For the technically oriented reader, this brief paper describes the technical foundation of the Knowledge Correlation Search Engine - patented by Make Sence, Inc.
Use text mining method to support criminal case judgmentZhongLI28
This is not my original work. Copyright belongs to the original author. If there is any infringement, please contact us immediately, we will deal with it promptly.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Chapter 1: Introduction to Information Storage and Retrievalcaptainmactavish1996
Course material for 3rd year Information Technology students. Information Storage and Retrieval Course. Chapter 1: Introduction to Information storage and retrieval
A Simple Information Retrieval Techniqueidescitation
This research examines and analyzes the
information retrieval techniques. The amount of information
available over networks grows every day. This information
worths being accessed and structured. Indexation and
information retrieval are essential tasks to realize these
objectives. This paper proposes an information retrieval
technique which can retrieve appropriate document among a
lot of documents. For doing this first, simplify all the
documents. Then remove stop words and punctuations. It also
calculates the term frequency, inverse term frequency, weight
of each term etc. Here the proposed technique constructs the
master document matrix. By this information retrieval
technique anyone can easily search expected document from
a collection of documents.
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHcscpconf
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
The Process of Information extraction through Natural Language ProcessingWaqas Tariq
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet.. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or need statements, clustering of document collections on the basis of language or topic, and statistical, probabilistic, and semantic methods of analyzing and retrieving documents. Information extraction from text has therefore been pursued actively as an attempt to present knowledge from published material in a computer readable format. An automated extraction tool would not only save time and efforts, but also pave way to discover hitherto unknown information implicitly conveyed in this paper. Work in this area has focused on extracting a wide range of information such as chromosomal location of genes, protein functional information, associating genes by functional relevance and relationships between entities of interest. While clinical records provide a semi-structured, technically rich data source for mining information, the publications, in their unstructured format pose a greater challenge, addressed by many approaches.
Technical Whitepaper: A Knowledge Correlation Search Engines0P5a41b
For the technically oriented reader, this brief paper describes the technical foundation of the Knowledge Correlation Search Engine - patented by Make Sence, Inc.
Use text mining method to support criminal case judgmentZhongLI28
This is not my original work. Copyright belongs to the original author. If there is any infringement, please contact us immediately, we will deal with it promptly.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Chapter 1: Introduction to Information Storage and Retrievalcaptainmactavish1996
Course material for 3rd year Information Technology students. Information Storage and Retrieval Course. Chapter 1: Introduction to Information storage and retrieval
A Simple Information Retrieval Techniqueidescitation
This research examines and analyzes the
information retrieval techniques. The amount of information
available over networks grows every day. This information
worths being accessed and structured. Indexation and
information retrieval are essential tasks to realize these
objectives. This paper proposes an information retrieval
technique which can retrieve appropriate document among a
lot of documents. For doing this first, simplify all the
documents. Then remove stop words and punctuations. It also
calculates the term frequency, inverse term frequency, weight
of each term etc. Here the proposed technique constructs the
master document matrix. By this information retrieval
technique anyone can easily search expected document from
a collection of documents.
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHcscpconf
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
The Process of Information extraction through Natural Language ProcessingWaqas Tariq
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet.. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or need statements, clustering of document collections on the basis of language or topic, and statistical, probabilistic, and semantic methods of analyzing and retrieving documents. Information extraction from text has therefore been pursued actively as an attempt to present knowledge from published material in a computer readable format. An automated extraction tool would not only save time and efforts, but also pave way to discover hitherto unknown information implicitly conveyed in this paper. Work in this area has focused on extracting a wide range of information such as chromosomal location of genes, protein functional information, associating genes by functional relevance and relationships between entities of interest. While clinical records provide a semi-structured, technically rich data source for mining information, the publications, in their unstructured format pose a greater challenge, addressed by many approaches.
Information Retrieval System is an effective process that helps a user to trace relevant information by
Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic
Information Retrieval System(BIRS) based on information and the system is significant mathematically
and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of
Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as
compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora
resolution system using the Hobbs’ algorithm to get the correct expression of information. As the actions
of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the
accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit
(BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have
also developed Bengali root word’s corpus, synonym word’s corpus, stop word’s corpus and gathered
672 articles from the popular Bengali newspapers ‘The Daily Prothom Alo’ which is our inserted
information. For testing this system, we have created 19335 questions from the introduced information
and got 97.22% accurate answer.
Information Retrieval System is an effective process that helps a user to trace relevant information by Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic Information Retrieval System(BIRS) based on information and the system is significant mathematically and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora resolution system using the Hobbs’ algorithm to get the correct expression of information. As the actions of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit (BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have also developed Bengali root word’s corpus, synonym word’s corpus, stop word’s corpus and gathered 672 articles from the popular Bengali newspapers ‘The Daily Prothom Alo’ which is our inserted information. For testing this system, we have created 19335 questions from the introduced information and got 97.22% accurate answer.
Semantic tagging for documents using 'short text' informationcsandit
Tagging documents with relevant and comprehensive k
eywords offer invaluable assistance to
the readers to quickly overview any document. With
the ever increasing volume and variety of
the documents published on the internet, the intere
st in developing newer and successful
techniques for annotating (tagging) documents is al
so increasing. However, an interesting
challenge in document tagging occurs when the full
content of the document is not readily
accessible. In such a scenario, techniques which us
e “short text”, e.g., a document title, a news
article headline, to annotate the entire article ar
e particularly useful. In this paper, we pro-
pose a novel approach to automatically tag document
s with relevant tags or key-phrases using
only “short text” information from the documents. W
e employ crowd-sourced knowledge from
Wikipedia, Dbpedia, Freebase, Yago and similar open
source knowledge bases to generate
semantically relevant tags for the document. Using
the intelligence from the open web, we prune
out tags that create ambiguity in or “topic drift”
from the main topic of our query document.
We have used real world dataset from a corpus of re
search articles to annotate 50 research
articles. As a baseline, we used the full text info
rmation from the document to generate tags. The
proposed and the baseline approach were compared us
ing the author assigned keywords for the
documents as the ground truth information. We found
that the tags generated using proposed
approach are better than using the baseline in term
s of overlap with the ground truth tags
measured via Jaccard index (0.058 vs. 0.044). In te
rms of computational efficiency, the
proposed approach is at least 3 times faster than t
he baseline approach. Finally, we
qualitatively analyse the quality of the predicted
tags for a few samples in the test corpus. The
evaluation shows the effectiveness of the proposed
approach both in terms of quality of tags
generated and the computational time.
A Domain Based Approach to Information Retrieval in Digital LibrariesFulvio Rotella
The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, improving the retrieval performance must necessarily go beyond simple lexical interpretation ofthe user queries, and pass through an understanding of their semantic content and aims. It goes without saying that any digital library wouldtake enormous advantage from the availability of effective Information Retrieval techniques to provide to their users. This paper proposes an approach to Information Retrieval based on a correspondence of the domain of discourse between the query and the documents in the repository. Such an association is based on standard general-purpose linguistic resources (WordNet and WordNet Domains) and on a novel similarity assessmenttechnique. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach.
Similar to Rule Legal Services, General Counsel, And Miscellaneous Claims Service Organization Representative (20)
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Climate Impact of Software Testing at Nordic Testing Days
Rule Legal Services, General Counsel, And Miscellaneous Claims Service Organization Representative
1. Enhancing Legal Discovery with Linguistic Processing 1
D.G. Bobrow, T.H. King, and L.C. Lee
PARC
Enhancing Legal Discovery with Linguistic Processing
Daniel G. Bobrow, Tracy H. King, and Lawrence C. Lee
Palo Alto Research Center Inc.
www.parc.com/nlp
Introduction
The U.S. Federal Rules of Civil Procedure have greatly increased the importance of
understanding the content within very large collections of electronically stored information.
Traditional search methods using full-text indexing and Boolean keyword queries are often
inadequate for e-discovery. They typically return too many results (low precision) or require
tightly defined queries that miss critical documents (low recall.) Linguistic processing offers a
solution to increase both the precision and recall of e-discovery applications. We discuss four
issues in legal discovery that can be enhanced with linguistic processing: improving recall for
characterization, improving precision for search, protection of sensitive information, and
scalability.
Characterization: Recall
Especially in the initial stages of trial preparation, attorneys need to be able to retrieve all of the
information in a collection that is relevant to some characterization of interest. These
characterizations depend on the legal strategy and so need to be able to be quickly and flexibly
formulated. The most natural way to describe such content is in natural language and not in
heavily formalized regular expression languages. Linguistic processing on the query can help
generate rules in a higher level language much closer to natural language.
Two basic linguistic tools to aid in query generation for characterization are morphological
analysis and ontological information. For example, morphological analysis of the term 'buy' in a
query will produce 'buy', 'buying', and 'bought'. The more abbreviated and elliptical texts found
in email documents can be treated similarly. For example, common email abbreviations like
'mtg' can be run through a type of morphological analysis to match against 'mtg', 'meeting', and
'meetings'. Using a disjunction of all these forms in the search increases recall, which returns
both more relevant documents and more passages with examples from which to produce novel
queries. Ontologies, both domain specific and general, automatically produce synonyms
('buy'='purchase') and hypernyms (a boy is type of child is a type of human) which can be used to
expand the sample query into alternatives, again allowing for greater recall at the initial stages of
the characterization task. During this initial step, where recall is important and the entire
information collection is being culled, linguistic processing is only being done on the queries,
while the search over the information is done with more standard search techniques. This allows
massive information collections to be quickly processed more rapidly and thoroughly.
Search: Precision
An important aspect of legal discovery is finding information that answer specific questions or
that say specific things. By automatically processing the texts into more normalized, deep
semantic structures and then indexing these structures into a large database optimized for
May 2007
2. Enhancing Legal Discovery with Linguistic Processing 2
D.G. Bobrow, T.H. King, and L.C. Lee
PARC
semantic search, queries over the information collection can be made in natural language. These
linguistic structures normalize away from the vagaries of natural language sentences, encoding
the underlying meaning. At the simplest level, surface forms of words are stemmed to their
dictionary entry and synonyms and hypernyms are inserted. However, the linguistic processing
can go much deeper, normalizing different syntactic constructions so that expressions which
mean the same thing have the same linguistic structure. As a simple example, 'Mr. Smith bought
4000 shares of common stock.' and '4000 shares of common stock were bought by Mr. Smith'
will be mapped to the same structure and indexed identically. Thus the creation of this
semantically based index of information stores a normalized but highly detailed version of the
content in the information and includes links back to the original passages in the information.
The queries against the information collection are similarly automatically processed into
semantic representations at query time, and these semantic representations are used to query the
database for relevant documents. Unlike more standard search techniques, using the deeper
semantic structures allows for greater precision and hence fewer irrelevant documents to review.
The linguistic structures encode the relations between entities and actions (e.g., who did what
when) so that only documents describing entities in the desired relations are retrieved. For
example, standard search techniques would retrieve both 'X hit Y' and 'Y hit X' from a search on
the entities X and Y and the 'hit' relation since all of the relevant items are mentioned. However,
when searching for evidence in a massive information collection, it is important to return only
the text passages which refer to the intended relationship among the entities.
Redaction
E-discovery increases in complexity when issues of confidentiality are considered. Over the past
several years we have been researching intelligent document security solutions, initially focusing
on redaction. This line of research involves building better tools to detect sensitive material in
documents, especially entities and sensitive relations between entities, determining whether
inferences can be made even when sensitive passages have been redacted, and providing efficient
encryption techniques to allow content-driven access control.
The detection of sensitive material works on the same underlying technology described above for
enhancing recall and precision. The use of stemming, synonyms and hypernyms, and automatic
alias production increase recall, allowing for a single search to retrieve entities in many surface
forms. The structural normalization provided by the deep processing similarly allows for better
relation and context detection. As an additional part of the content discovery for redaction, our
current research examines ways to allow for collaborative work on the same document collection
so that knowledge discovery workers can benefit from each other's work and so that experts can
help hone the skills of novices. Another component of the project involves using the Web and
other large information collections to determine whether the identity of entities can be detected
even when they have been redacted. For example, removing someone's name but leaving their
birthdate, sex, and zip code may uniquely identify them, thereby suggesting that further material
needs to be redacted.
May 2007
3. Enhancing Legal Discovery with Linguistic Processing 3
D.G. Bobrow, T.H. King, and L.C. Lee
PARC
Once the sensitive text passages have been identified, we provide tools for encrypting document
passages and assigning keys so that different users can have access to different types of redacted
material. This makes it possible for the document to be viewed in different ways by different
people: some may have access to the whole document, some may not be able to see anything
related to entity X, and some may only be able to see publicly available material. This
encryption capability can either be used actively on the electronic versions of the documents or
can be used to prepare specially redacted versions for printing and shipping to different parties.
Scalability
As the average number of documents involved in each legal discovery process increases,
scalability is an important issue for any technology used in the process. The linguistic
processing that we advocate here is more computationally intensive than shallower methods such
as keyword search or basic regular expression pattern matching over plain text. To surmount this
issue, we use faster processes to go from, for example, 100 million documents to a few million
documents; these faster processes may be facilitated by some linguistic processing, e.g.
stemming of words so that more matches on basic keyword searches are found. Once the
original information collection is reduced to a more manageable load, then the slower but more
accurate linguistically-enhanced processes can be used to prune to a few hundred thousand. We
have evidence that this deeper linguistic processing will scale to hundreds of thousands of
documents, with processing time approaching one second per sentence. Once this initial
linguistic processing is done, then the resulting indexed documents can be used repeatedly in the
applications described above, thereby creating a resource to be shared across the discovery
processes.
Conclusion
There are a number of benefits from using linguistic processing in e-discovery applications.
Linguistic processing can provide fast and flexible characterization of large information
collections in pre-trial preparation, as well as enable high precision search and confidential
information access in discovery. While linguistic processing is more computationally intensive
than keyword search, the technology does scale well to large information collections and can
also be used in combination with standard search approaches to improve the management and
discovery of electronically stored information.
May 2007