Contract Management and Technology - seal software
Whitepaper Introducing C Discovery
1. Introducing cDiscovery
Why search and eDiscovery are inadequate when trying to
locate contracts in an enterprise
1
Whitepaper
2. Contents
Background 3
Contractual Information Formats 5
Relational and Relevance 6
Automated Contract Recognition 7
Information Normalisation 8
High Risk / Value Clause Detection 9
cDiscovery Extensions over Search and eDiscovery 10
OCR and spell checking 10
Table Recognition and Information Extraction 11
Signature Detection 11
Customer Specific Information 12
Search Methods 13
Search and eDiscovery limitations 13
Summary 14
May 2011
2
4. An example of the size of the issue is best illustrated with a simple
example based on the contracting parties of a contract.
Each contracting party could be a person, company, organisation,
country, department or any other combination of addresses and
entities. To effectively search, locate and review contracts based on
contracting parties alone, a user would need to know the parties
before starting the search, filter out the average 80%+ false positives,
open them and review each item for the correct data.
The reason why the false positives could be so high is a factor of
the search. Take for example a contracting party of Sony Music.
Searching for this alone would produce a result set consisting of
every item where Sony Music is mentioned. The search is not able to
differentiate between a simple document referring to a music track
held by Sony Music, or a contract where Sony Music is one of the
contracting parties.
A simple test of this is to place the following search into Google, “Sony
music + “contracts”, this will result in over 1.2 million hits, with very
few if any being actual contracts. If we further refine the search with
“Sony music” +”contracting party”, the results are just over 1200
items. However of those items none are actual contracts where Sony
Music is the contracting party.
While a Google search of the internet is not a direct representation
of internal customer environments, the challenges remain the same.
With many organisations already using an internal search engine or
eDiscovery, they do not provide the required functionality to meet
the challenge of locating contractual information location.
This simple illustration shows that to effectively search, discover
and manage contracts a new approach is required, targeted at
the information held within contracts. With this in mind, a new
technology and methodology is needed.
We will at this point in the whitepaper, introduce a new technology
called cDiscovery. This term will be used to refer to the discovery of
contracts throughout the rest of this document. Currently the only
May 2011
cDiscovery solution on the market, and the basis for reference within
this document, is the Seal Software cDiscovery solution.
4
5. Contractual Information Formats
To further understand cDiscovery’s importance, it is necessary to
consider contractual formats and layouts. Contracts in many cases
can be free form information items, with dates, parties, clauses and
obligations randomly distributed within the documents.
With the majority of contracts being in image formats, Optical
Character Recognition (OCR) is required to extract information. During
this extraction process, errors due to poor quality can be introduced,
for example an “I” becomes an “!”, thus causes the “client” to become
the “cl!ent”.
Further formats, such as images imbedded within PDF files, also
require specific handling and processing. For example when
processing PDF files, does the system use a PDF ifilter or equivalent
or does it process the items via an OCR engine because it contains
embedded images?
Not only are the actual document formats to be considered, the
layouts within the documents must also be recognised. Take for
example a contract with a table detailing the contract party, contract
values and jurisdiction, the relation of the headings and cells needs to
be understood to enable effective processing and discovery. Simple
text extraction is not capable of producing a relational view or data
correlation between cells.
5
6. Relational and Relevance
One of the main advantages of cDiscovery over Search and
eDiscovery, is its ability to determine the relational mapping and
relevance between information within the contract. To illustrate
this point, a contract or document can contain a location, say New
York. To determine the relevance and importance of this location,
the system must process the preceding and following terms, words,
phrases and sentences. Thus within the processing of the location,
the system must first discover the location, investigate its context and
then extract the information if required.
The process of identifying the contracts Jurisdiction is a good example
of relevance. The actual contract might have many differing locations,
countries or states listed within it. Thus to determine the Jurisdiction,
Governing Law and Applicable Law all need to be accounted for.
This can only be done when an understanding of the relevance and
relational positioning of the relevant terms is understood.
While standard search engines are capable of determining locations
and presenting filters based on them, they don’t present the user
with information targeted at the relevant and relational level. In many
cases only the location is accounted for.
Further illustration of relevance and relational information can be
applied to the first example. Let’s take cDiscovery as the discovery
engine instead of a Search or eDiscovery engine. The search results
now return only items where “Sony Music” is actually listed as the
contracting parties, thus reducing the amount of “noise” and false
positive results.
May 2011
6
7. Automated Contract Recognition
Even with the relevance and relational awareness detailed above,
further methods are needed to detect contracts and provide users
with a simple proactive view of the contractual information.
One area where this is important is within the actual recognition
of the contract type. cDiscovery solutions present a graded level of
confidence on items classified as contracts. This is important as false
positives will likely occur, though greatly reduced.
To present a graded confidence level on discovered contracts,
the system needs to extract the “type” of contract discovered. To
effectively do this, not only are the relational and relevance methods
needed, but also the dynamic building of contract types is required.
Take a simple example, a Non-Disclosure Agreement could be listed
within a contract as Non-Disclosure, Non-Disclosure Agreement, NDA,
Mutual NDA etc. One can see there are many differing combinations
for the same contract type. Thus to ensure that the correct contract
type is applied, dynamic building of the contract types based on
wording, phrases and relational information needs to be applied. This
is a significant benefit of the cDiscovery methodology and application
over standard search and eDiscovery methods.
Once a contract type has been identified, the confidence level
that the item is actually a contract is at its highest level. There are
contracts that will be extracted with no contract type, but contain
relevant contractual information. These are given the next highest
relevance scores, thus leaving items that contain some contractual
matches. For example, such as a cover letter for a contract with
details on start and termination notices.
7
8. Information Normalisation
One area commonly overlooked and misunderstood within a search
process is information normalisation. Information Normalisation is the
process of automatically determining the correct value when ambigu-
ity exists. This can be most often seen when processing dates.
Date formats can be US English, UK English, European and many
others, with short, long and textual dates being used. An example
of this is 01/06/10. This date can be the 6th of January 2010 or the
1st of June 2010 based on only US and UK formats. If this is further
extrapolated to word based dates, this becomes the First day of
January 2010. It is clear that normalisation absolutely needs to take
place.
The Normalisation process covers not only dates; it also covers
locations, people and companies, where short names or abbreviations
are used.
The cDiscovery process, unlike search engines, needs to understand
the relevance and context of dates and formats within contracts. It
also needs to normalise the information. Without it you could miss a
renewal or termination date by 6 months, referring to our example
above.
This process of understanding the local and relevance of the
information is a key differentiator between cDiscovery and Search or
eDiscovery methods.
May 2011
8
10. cDiscovery Extensions over
Search and eDiscovery
Within the preceding sections, reference has been made to
cDiscovery functionality and how this differs from within a standard
search solution. To further understand the extensions provided over
and above search and eDiscovery, some key functional extensions are
required.
OCR and spell checking
As many, if not all, images, files, TIFF, GIFF, PDF etc, are embedded
into contracts. OCR processing and information capture therefore
needs to be performed. During this process, the quality of the
scanned images can introduce noise and errors within the text. The
errors introduced could, if left unmanaged, cause contracts to be
missed during the discovery phase.
To counter and eliminate, where possible, errors of this nature spell
checking and intelligent processing needs to be performed. Intelligent
processing of spelling mistakes is where the application again looks
to the surrounding wordings and phrases to determine the best
contextual and relevant replacement for an incorrectly spelled word.
While some search engines do provide spelling suggestions and
corrections, this is based primarily on the Levenshtein distance or
dictionary based lookups of common words. While this might work
for searching and eDiscovery methods alone, it does not provide the
required relevance and proximity calculations. This method also relies
on users typing errors, rather than errors being correct at the source
of extraction.
May 2011
10
11. Table Recognition and
Information Extraction
With eDiscovery and Search being targeted at finding as much
information as possible and leaving the processing to the users,
formatting and tabular data is not processed in context. While this is
OK for searches and eDiscovery tasks, when dealing with relational
contractual data, tabular information must be accounted for.
Take for example a pricing structure that is based on a table, with
dates, items and values including a total contractual value within the
cells. In most if not all cases the eDiscovery and search engines will
extract all the headings followed by all the data as a single stream of
text, this totally removes any relationship between the headings, cells
and columns, thus making it impossible to determine the context and
relevance of the information.
While cDiscovery relies on being able to process information within
context, it is imperative that it maintains the linkage between items.
Therefore, it is required to be able to process tabular information
within the tables, which can become challenging when dealing with
image based items.
Signature Detection
To further reduce the possibility of false positive results and to target
signed contracts, the capability to detect a possible signature within
the contract should be available. With this detection, users can be
presented with a targeted set of contracts that have a very high
confidence level of being entered into contracts.
Combining this with the extracted information, termination and
renewal dates or notice periods, a risk and value matrix can be quickly
determined.
11
12. Customer Specific Information
A further extension that the cDiscovery solution provides is the ability
to allow the application to “learn” about the environment it has been
installed into. In much the same was as a child learns by examples and
reference information, the cDiscovery solution should be able to learn
as well. It should not only be able for example to recognise a simple
list say of companies or people, that are known to the organisation;
it should also be able to quickly use and incorporate this information
into the processing algorithms to improve accuracy and extraction of
relevant information.
May 2011
12
13. Search Methods
As with eDiscovery, cDiscovery requires a search engine to process
and present information to users. Thus, allowing users to search
based on the full text information within the contracts or the
proactive extraction of contextual information.
Because of the proactive extraction of information, cDiscovery
solutions can present users with information without the users
knowing what they are looking for. An example of this type of
information management and presentation is the contracting party
and contract type.
Take the first example of “Sony Music”. However this time the user is
presented with a view that lists groups of all contracts based on the
type, say Intellectual Property Sale and Sales Contracts. At this point
the user only needs to select the view or Faceted search view, to see
only the contracts relating to its type. Add to this ability to the search
for Contracting Parties of Sony Music, and the system presents the
user with accurate and targeted results with the ability to view all the
extracted information within a single view.
Search and eDiscovery limitations
The main challenges faced by Search and eDiscovery methods today,
are lost metadata and formatting when documents are converted to
image type files. Most if not all entered into contracts are image files,
with historical data almost always being faxed versions of the original
signed contract.
As has been previously detailed, the loss of formatting and metadata
causes the applications to only extract streams of text, depending on
if an OCR process has been used. Even with the OCR process, little or
no error correction and information correlation is performed by the
eDiscovery and Search engines, thus introducing errors within the
extracted text.
13