Whitepaper Introducing C Discovery

Introducing cDiscovery
Why search and eDiscovery are inadequate when trying to
locate contracts in an enterprise

1

Whitepaper

Contents
Background 3

Contractual Information Formats 5

Relational and Relevance 6

Automated Contract Recognition 7

Information Normalisation 8
High Risk / Value Clause Detection 9

cDiscovery Extensions over Search and eDiscovery 10

OCR and spell checking 10

Table Recognition and Information Extraction 11

Signature Detection 11

Customer Specific Information 12

Search Methods 13

Search and eDiscovery limitations 13

Summary 14
May 2011

2

Background
Within today’s enterprise are many different document types held
within many differing information sources. One such information type
is contracts, with differing document formats such as scanned images,
office files and PDF’s.

Historically contracts have been held within file shares with limited
metadata attached to them as TIFF or image embedded PDF files,
originating from scanners or email attachments. Due to the nature of
contractual information, contracts need to be signed by both parties,
and as such the final contract would be held as either the original
paper document or, in most cases, faxed back to the contracting
parties.

Within each business unit and geographical location, various contract
management policies could have been deployed. This can create a
dispersed and highly irregular contract management environment.

Adding to the complexity, each location or department could have
deployed and used many differing contract templates and have
received hundreds of various inbound contract formats.

As many of the formats, layouts and information will be unknown,
searching for information becomes an arduous task. With every
different contracting party, for example, the number of combinations
and “false positives” will increase, based on standard eDiscovery or
search methods.

A false positive is defined as: - “relating to or being an individual or a
test result that is erroneously classified in a positive category”

Relating this directly back to search, you are given a result that
matches your query, however it is not a correct match.

A common misunderstanding regarding eDiscovery is it is perceived
to provide rich contextual information. Some eDiscovery solutions use
digital fingerprints, NIST’s, to remove non relevant data to improve
the relevancy of result sets. This approach however is the direct
inverse of the cDiscovery process, which classifies IN information.
Additionally, eDiscovery uses search within the processing with the
same functional limitations relating to contractual information.
3

An example of the size of the issue is best illustrated with a simple
example based on the contracting parties of a contract.

Each contracting party could be a person, company, organisation,
country, department or any other combination of addresses and
entities. To effectively search, locate and review contracts based on
contracting parties alone, a user would need to know the parties
before starting the search, filter out the average 80%+ false positives,
open them and review each item for the correct data.

The reason why the false positives could be so high is a factor of
the search. Take for example a contracting party of Sony Music.
Searching for this alone would produce a result set consisting of
every item where Sony Music is mentioned. The search is not able to
differentiate between a simple document referring to a music track
held by Sony Music, or a contract where Sony Music is one of the
contracting parties.

A simple test of this is to place the following search into Google, “Sony
music + “contracts”, this will result in over 1.2 million hits, with very
few if any being actual contracts. If we further refine the search with
“Sony music” +”contracting party”, the results are just over 1200
items. However of those items none are actual contracts where Sony
Music is the contracting party.

While a Google search of the internet is not a direct representation
of internal customer environments, the challenges remain the same.
With many organisations already using an internal search engine or
eDiscovery, they do not provide the required functionality to meet
the challenge of locating contractual information location.

This simple illustration shows that to effectively search, discover
and manage contracts a new approach is required, targeted at
the information held within contracts. With this in mind, a new
technology and methodology is needed.

We will at this point in the whitepaper, introduce a new technology
called cDiscovery. This term will be used to refer to the discovery of
contracts throughout the rest of this document. Currently the only
May 2011

cDiscovery solution on the market, and the basis for reference within
this document, is the Seal Software cDiscovery solution.

4

Contractual Information Formats

To further understand cDiscovery’s importance, it is necessary to
consider contractual formats and layouts. Contracts in many cases
can be free form information items, with dates, parties, clauses and
obligations randomly distributed within the documents.

With the majority of contracts being in image formats, Optical
Character Recognition (OCR) is required to extract information. During
this extraction process, errors due to poor quality can be introduced,
for example an “I” becomes an “!”, thus causes the “client” to become
the “cl!ent”.

Further formats, such as images imbedded within PDF files, also
require specific handling and processing. For example when
processing PDF files, does the system use a PDF ifilter or equivalent
or does it process the items via an OCR engine because it contains
embedded images?

Not only are the actual document formats to be considered, the
layouts within the documents must also be recognised. Take for
example a contract with a table detailing the contract party, contract
values and jurisdiction, the relation of the headings and cells needs to
be understood to enable effective processing and discovery. Simple
text extraction is not capable of producing a relational view or data
correlation between cells.

5

Relational and Relevance

One of the main advantages of cDiscovery over Search and
eDiscovery, is its ability to determine the relational mapping and
relevance between information within the contract. To illustrate
this point, a contract or document can contain a location, say New
York. To determine the relevance and importance of this location,
the system must process the preceding and following terms, words,
phrases and sentences. Thus within the processing of the location,
the system must first discover the location, investigate its context and
then extract the information if required.

The process of identifying the contracts Jurisdiction is a good example
of relevance. The actual contract might have many differing locations,
countries or states listed within it. Thus to determine the Jurisdiction,
Governing Law and Applicable Law all need to be accounted for.
This can only be done when an understanding of the relevance and
relational positioning of the relevant terms is understood.

While standard search engines are capable of determining locations
and presenting filters based on them, they don’t present the user
with information targeted at the relevant and relational level. In many
cases only the location is accounted for.

Further illustration of relevance and relational information can be
applied to the first example. Let’s take cDiscovery as the discovery
engine instead of a Search or eDiscovery engine. The search results
now return only items where “Sony Music” is actually listed as the
contracting parties, thus reducing the amount of “noise” and false
positive results.
May 2011

6

Automated Contract Recognition
Even with the relevance and relational awareness detailed above,
further methods are needed to detect contracts and provide users
with a simple proactive view of the contractual information.

One area where this is important is within the actual recognition
of the contract type. cDiscovery solutions present a graded level of
confidence on items classified as contracts. This is important as false
positives will likely occur, though greatly reduced.

To present a graded confidence level on discovered contracts,
the system needs to extract the “type” of contract discovered. To
effectively do this, not only are the relational and relevance methods
needed, but also the dynamic building of contract types is required.

Take a simple example, a Non-Disclosure Agreement could be listed
within a contract as Non-Disclosure, Non-Disclosure Agreement, NDA,
Mutual NDA etc. One can see there are many differing combinations
for the same contract type. Thus to ensure that the correct contract
type is applied, dynamic building of the contract types based on
wording, phrases and relational information needs to be applied. This
is a significant benefit of the cDiscovery methodology and application
over standard search and eDiscovery methods.

Once a contract type has been identified, the confidence level
that the item is actually a contract is at its highest level. There are
contracts that will be extracted with no contract type, but contain
relevant contractual information. These are given the next highest
relevance scores, thus leaving items that contain some contractual
matches. For example, such as a cover letter for a contract with
details on start and termination notices.

7

Information Normalisation

One area commonly overlooked and misunderstood within a search
process is information normalisation. Information Normalisation is the
process of automatically determining the correct value when ambigu-
ity exists. This can be most often seen when processing dates.

Date formats can be US English, UK English, European and many
others, with short, long and textual dates being used. An example
of this is 01/06/10. This date can be the 6th of January 2010 or the
1st of June 2010 based on only US and UK formats. If this is further
extrapolated to word based dates, this becomes the First day of
January 2010. It is clear that normalisation absolutely needs to take
place.

The Normalisation process covers not only dates; it also covers
locations, people and companies, where short names or abbreviations
are used.

The cDiscovery process, unlike search engines, needs to understand
the relevance and context of dates and formats within contracts. It
also needs to normalise the information. Without it you could miss a
renewal or termination date by 6 months, referring to our example
above.
This process of understanding the local and relevance of the
information is a key differentiator between cDiscovery and Search or
eDiscovery methods.
May 2011

8

High Risk / Value Clause Detection

Another benefit of cDiscovery, is its ability to identify contractual
clauses or wordings that present risk or value to the organisation.
Once such example, is the “Assignment “clause within many contracts.
The contracting parties either have, or don’t have the right to assign
the contract during a sale, merger or outsourcing event.

Recognition of the risk within the clause is also extended to
understanding dates and relative time periods. Take for example a
conditional assignment of a contract, where the main contracting
party is given the right to assign but must first provide 28 days written
notice to all parties.

Again the detection of relevance, proximity, durations and the
normalisation of values is required. Thus to understand the inherent
risk or value of a clause or body of text, the cDiscovery solution must
correlate multiple values.

The ability to quickly extended and tailor the detection and extraction
of key contextual metadata, is also a critical aspect of the cDiscovery
process. An item of value to one company can be seen as high risk to
another. Thus a cDiscovery solution must have the ability to “learn” so
that it can be tailored to customers’ needs based on “teaching”. This
iterative process improves and refines cDiscovery’s overall accuracy
and precision.

Search and standard eDiscovery methods are not well positioned
to provide this level of correlation, they are designed to provide
fast access to result sets over millions of documents, but leave the
correlation and understanding to the user.

9

cDiscovery Extensions over
Search and eDiscovery
Within the preceding sections, reference has been made to
cDiscovery functionality and how this differs from within a standard
search solution. To further understand the extensions provided over
and above search and eDiscovery, some key functional extensions are
required.

OCR and spell checking

As many, if not all, images, files, TIFF, GIFF, PDF etc, are embedded
into contracts. OCR processing and information capture therefore
needs to be performed. During this process, the quality of the
scanned images can introduce noise and errors within the text. The
errors introduced could, if left unmanaged, cause contracts to be
missed during the discovery phase.

To counter and eliminate, where possible, errors of this nature spell
checking and intelligent processing needs to be performed. Intelligent
processing of spelling mistakes is where the application again looks
to the surrounding wordings and phrases to determine the best
contextual and relevant replacement for an incorrectly spelled word.

While some search engines do provide spelling suggestions and
corrections, this is based primarily on the Levenshtein distance or
dictionary based lookups of common words. While this might work
for searching and eDiscovery methods alone, it does not provide the
required relevance and proximity calculations. This method also relies
on users typing errors, rather than errors being correct at the source
of extraction.
May 2011

10

Table Recognition and
Information Extraction

With eDiscovery and Search being targeted at finding as much
information as possible and leaving the processing to the users,
formatting and tabular data is not processed in context. While this is
OK for searches and eDiscovery tasks, when dealing with relational
contractual data, tabular information must be accounted for.

Take for example a pricing structure that is based on a table, with
dates, items and values including a total contractual value within the
cells. In most if not all cases the eDiscovery and search engines will
extract all the headings followed by all the data as a single stream of
text, this totally removes any relationship between the headings, cells
and columns, thus making it impossible to determine the context and
relevance of the information.

While cDiscovery relies on being able to process information within
context, it is imperative that it maintains the linkage between items.
Therefore, it is required to be able to process tabular information
within the tables, which can become challenging when dealing with
image based items.

Signature Detection

To further reduce the possibility of false positive results and to target
signed contracts, the capability to detect a possible signature within
the contract should be available. With this detection, users can be
presented with a targeted set of contracts that have a very high
confidence level of being entered into contracts.

Combining this with the extracted information, termination and
renewal dates or notice periods, a risk and value matrix can be quickly
determined.

11

Customer Specific Information

A further extension that the cDiscovery solution provides is the ability
to allow the application to “learn” about the environment it has been
installed into. In much the same was as a child learns by examples and
reference information, the cDiscovery solution should be able to learn
as well. It should not only be able for example to recognise a simple
list say of companies or people, that are known to the organisation;
it should also be able to quickly use and incorporate this information
into the processing algorithms to improve accuracy and extraction of
relevant information.
May 2011

12

Search Methods
As with eDiscovery, cDiscovery requires a search engine to process
and present information to users. Thus, allowing users to search
based on the full text information within the contracts or the
proactive extraction of contextual information.

Because of the proactive extraction of information, cDiscovery
solutions can present users with information without the users
knowing what they are looking for. An example of this type of
information management and presentation is the contracting party
and contract type.

Take the first example of “Sony Music”. However this time the user is
presented with a view that lists groups of all contracts based on the
type, say Intellectual Property Sale and Sales Contracts. At this point
the user only needs to select the view or Faceted search view, to see
only the contracts relating to its type. Add to this ability to the search
for Contracting Parties of Sony Music, and the system presents the
user with accurate and targeted results with the ability to view all the
extracted information within a single view.

Search and eDiscovery limitations

The main challenges faced by Search and eDiscovery methods today,
are lost metadata and formatting when documents are converted to
image type files. Most if not all entered into contracts are image files,
with historical data almost always being faxed versions of the original
signed contract.

As has been previously detailed, the loss of formatting and metadata
causes the applications to only extract streams of text, depending on
if an OCR process has been used. Even with the OCR process, little or
no error correction and information correlation is performed by the
eDiscovery and Search engines, thus introducing errors within the
extracted text.

13

With the loss of the metadata and the induced errors, accurate
discovery and classification of contracts becomes a significant
challenge, and one that current Search and eDiscovery engines cannot
meet.

Summary
It should be clear that within an effective contracts discovery process,
additional functions and methods are needed over and above what
Search and eDiscovery offer.

cDiscovery is a combination of Search, eDiscovery, complex document
processing and targeted logic functions, for the proactive extraction
and presentation of information within context. The Search and
eDiscovery processes provide information reactively, relying on
users’ knowledge and efforts to complete the processing. cDiscovery
provides proactive presentation, as well as warnings on pending
contractual obligations or milestones.

cDiscovery should be seen as a logical extension to any eDiscovery
process, as the information discovered and extracted can be utilised
by the eDiscovery engines. Further to this, standard web services
interfaces are provided within the Search, eDiscovery and cDiscovery
applications. Processing of the correct information can therefore
occur within the appropriate application, with information flowing
seamlessly between each function.

With new regulations and reporting rules, companies can no longer
ignore contractual information within their environments. No
Enterprise Search or eDiscovery engine is complete without the
complement of cDiscovery processing.
May 2011

COMMERCIAL IN CONFIDENCE
© Copyright 2011. Seal Software Solutions Limited. All rights reserved.
The contents of this document are commercial in confidence and are not to be copied or supplied in
14 part or whole to third parties without the prior written consent of Seal Software Solutions Limited.

Whitepaper Introducing C Discovery

Recommended

Recommended

More Related Content

Similar to Whitepaper Introducing C Discovery

Similar to Whitepaper Introducing C Discovery (20)

Whitepaper Introducing C Discovery