SlideShare a Scribd company logo
1 of 14
Download to read offline
Introducing cDiscovery
             Why search and eDiscovery are inadequate when trying to
             locate contracts in an enterprise




                                                                       1

Whitepaper
Contents
           Background                                                 3

           	      Contractual	Information	Formats	     	      	   	   5

           	      Relational	and	Relevance	    	       	      	   	   6

           	      Automated	Contract	Recognition	      	      	   	   7		

           	      Information	Normalisation	   	       	      	   	   8
           	      High	Risk	/	Value	Clause	Detection	 	       	   	   9



           cDiscovery Extensions over Search and eDiscovery           10

                  OCR and spell checking                              10

           	      Table	Recognition	and	Information	Extraction	   	   11

           	      Signature	Detection	 	       	       	      	   	   11

           	      Customer	Specific	Information	       	      	   	   12



           Search Methods                                             13

           	      Search	and	eDiscovery	limitations	   	      	   	   13

           Summary                                                    14
May 2011




 2
Background
Within	today’s	enterprise	are	many	different	document	types	held	
within	many	differing	information	sources.	One	such	information	type	
is	contracts,	with	differing	document	formats	such	as	scanned	images,	
office	files	and	PDF’s.	

Historically	contracts	have	been	held	within	file	shares	with	limited	
metadata	attached	to	them	as	TIFF	or	image	embedded	PDF	files,	
originating	from	scanners	or	email	attachments.	Due	to	the	nature	of	
contractual	information,	contracts	need	to	be	signed	by	both	parties,	
and	as	such	the	final	contract	would	be	held	as	either	the	original	
paper	document	or,	in	most	cases,	faxed	back	to	the	contracting	
parties.

Within	each	business	unit	and	geographical	location,	various	contract	
management	policies	could	have	been	deployed.	This	can	create	a	
dispersed	and	highly	irregular	contract	management	environment.		

Adding	to	the	complexity,	each	location	or	department	could	have	
deployed	and	used	many	differing	contract	templates	and	have	
received	hundreds	of	various	inbound	contract	formats.

As	many	of	the	formats,	layouts	and	information	will	be	unknown,	
searching	for	information	becomes	an	arduous	task.	With	every	
different	contracting	party,	for	example,	the	number	of	combinations	
and	“false	positives”	will	increase,	based	on	standard	eDiscovery	or	
search	methods.

A	false	positive	is	defined	as:	-	“relating	to	or	being	an	individual	or	a	
test	result	that	is	erroneously	classified	in	a	positive	category”

Relating	this	directly	back	to	search,	you	are	given	a	result	that	
matches	your	query,	however	it	is	not	a	correct	match.

A common misunderstanding regarding eDiscovery is it is perceived
to	provide	rich	contextual	information.	Some	eDiscovery	solutions	use	
digital	fingerprints,	NIST’s,	to	remove	non	relevant	data	to	improve	
the	relevancy	of	result	sets.	This	approach	however	is	the	direct	
inverse	of	the	cDiscovery	process,	which	classifies	IN	information.	
Additionally,	eDiscovery	uses	search	within	the	processing	with	the	
same	functional	limitations	relating	to	contractual	information.		
                                                                              3
An example of the size of the issue is best illustrated with a simple
           example	based	on	the	contracting	parties	of	a	contract.

           Each	contracting	party	could	be	a	person,	company,	organisation,	
           country, department	or	any	other	combination	of	addresses	and	
           entities.	To	effectively	search,	locate	and	review	contracts	based	on	
           contracting	parties	alone,	a	user	would	need	to	know	the	parties	
           before	starting	the	search,	filter	out	the	average	80%+	false	positives,	
           open	them	and	review	each	item	for	the	correct	data.

           The	reason	why	the	false	positives	could	be	so	high	is	a	factor	of	
           the	search.	Take	for	example	a	contracting	party	of	Sony	Music.	
           Searching	for	this	alone	would	produce	a	result	set	consisting	of	
           every	item	where	Sony	Music	is	mentioned.	The	search	is	not	able	to	
           differentiate	between	a	simple	document	referring	to	a	music	track	
           held by Sony Music, or a contract where Sony Music is one of the
           contracting	parties.		

           A simple test of this is to place the following search into Google, “Sony
           music	+	“contracts”,	this	will	result	in	over	1.2	million	hits,	with	very	
           few	if	any	being	actual	contracts.	If	we	further	refine	the	search	with	
           “Sony	music”	+”contracting	party”,	the	results	are	just	over	1200	
           items.	However	of	those	items	none	are	actual	contracts	where	Sony	
           Music	is	the	contracting	party.	

           While	a	Google	search	of	the	internet	is	not	a	direct	representation	
           of	internal	customer	environments,	the	challenges	remain	the	same.	
           With	many	organisations	already	using	an	internal	search	engine	or	
           eDiscovery,	they	do	not	provide	the	required	functionality	to	meet	
           the	challenge	of	locating	contractual	information	location.

           This	simple	illustration	shows	that	to	effectively	search,	discover	
           and manage contracts a new approach is required, targeted at
           the	information	held	within	contracts.	With	this	in	mind,	a	new	
           technology	and	methodology	is	needed.

           We will at this point in the whitepaper, introduce a new technology
           called	cDiscovery.	This	term	will	be	used	to	refer	to	the	discovery	of	
           contracts	throughout	the	rest	of	this	document.	Currently	the	only	
May 2011




           cDiscovery	solution	on	the	market,	and	the	basis	for	reference	within	
           this	document,	is	the	Seal	Software	cDiscovery	solution.




 4
Contractual	Information	Formats

To further understand cDiscovery’s importance, it is necessary to
consider	contractual	formats	and	layouts.	Contracts	in	many	cases	
can	be	free	form	information	items,	with	dates,	parties,	clauses	and	
obligations	randomly	distributed	within	the	documents.	

With	the	majority	of	contracts	being	in	image	formats,	Optical	
Character	Recognition	(OCR)	is	required	to	extract	information.	During	
this	extraction	process,	errors	due	to	poor	quality	can	be	introduced,	
for	example	an	“I”	becomes	an	“!”,	thus	causes	the	“client”	to	become	
the	“cl!ent”.	

Further	formats,	such	as	images	imbedded	within	PDF	files,	also	
require	specific	handling	and	processing.	For	example	when	
processing	PDF	files,	does	the	system	use	a	PDF	ifilter	or	equivalent	
or does it process the items via an OCR engine because it contains
embedded images?

Not	only	are	the	actual	document	formats	to	be	considered,	the	
layouts	within	the	documents	must	also	be	recognised.	Take	for	
example a contract with a table detailing the contract party, contract
values	and	jurisdiction,	the	relation	of	the	headings	and	cells	needs	to	
be	understood	to	enable	effective	processing	and	discovery.	Simple	
text	extraction	is	not	capable	of	producing	a	relational	view	or	data	
correlation	between	cells.




                                                                            5
Relational	and	Relevance	

           One of the main advantages of cDiscovery over Search and
           eDiscovery,	is	its	ability	to	determine	the	relational	mapping	and	
           relevance	between	information	within	the	contract.	To	illustrate	
           this	point,	a	contract	or	document	can	contain	a	location,	say	New	
           York.	To	determine	the	relevance	and	importance	of	this	location,	
           the system must process the preceding and following terms, words,
           phrases	and	sentences.	Thus	within	the	processing	of	the	location,	
           the	system	must	first	discover	the	location,	investigate	its	context	and	
           then	extract	the	information	if	required.

           The	process	of	identifying	the	contracts	Jurisdiction	is	a	good	example	
           of	relevance.	The	actual	contract	might	have	many	differing	locations,	
           countries	or	states	listed	within	it.	Thus	to	determine	the	Jurisdiction,	
           Governing	Law	and	Applicable	Law	all	need	to	be	accounted	for.	
           This can only be done when an understanding of the relevance and
           relational	positioning	of	the	relevant	terms	is	understood.

           While	standard	search	engines	are	capable	of	determining	locations	
           and	presenting	filters	based	on	them,	they	don’t	present	the	user	
           with	information	targeted	at	the	relevant	and	relational	level.	In	many	
           cases	only	the	location	is	accounted	for.

           Further	illustration	of	relevance	and	relational	information	can	be	
           applied	to	the	first	example.	Let’s	take	cDiscovery	as	the	discovery	
           engine	instead	of	a	Search	or	eDiscovery	engine.	The	search	results	
           now	return	only	items	where	“Sony	Music”	is	actually	listed	as	the	
           contracting	parties,	thus	reducing	the	amount	of	“noise”	and	false	
           positive	results.					
May 2011




 6
Automated	Contract	Recognition
Even	with	the	relevance	and	relational	awareness	detailed	above,	
further methods are needed to detect contracts and provide users
with	a	simple	proactive	view	of	the	contractual	information.

One	area	where	this	is	important	is	within	the	actual	recognition	
of	the	contract	type.	cDiscovery	solutions	present	a	graded	level	of	
confidence	on	items	classified	as	contracts.	This	is	important	as	false	
positives	will	likely	occur,	though	greatly	reduced.

To	present	a	graded	confidence	level	on	discovered	contracts,	
the	system	needs	to	extract	the	“type”	of	contract	discovered.	To	
effectively	do	this,	not	only	are	the	relational	and	relevance	methods	
needed,	but	also	the	dynamic	building	of	contract	types	is	required.

Take	a	simple	example,	a	Non-Disclosure	Agreement	could	be	listed	
within	a	contract	as	Non-Disclosure,	Non-Disclosure	Agreement,	NDA,	
Mutual	NDA	etc.	One	can	see	there	are	many	differing	combinations	
for	the	same	contract	type.	Thus	to	ensure	that	the	correct	contract	
type is applied, dynamic building of the contract types based on
wording,	phrases	and	relational	information	needs	to	be	applied.	This	
is	a	significant	benefit	of	the	cDiscovery	methodology	and	application	
over	standard	search	and	eDiscovery	methods.

Once	a	contract	type	has	been	identified,	the	confidence	level	
that	the	item	is	actually	a	contract	is	at	its	highest	level.	There	are	
contracts that will be extracted with no contract type, but contain
relevant	contractual	information.	These	are	given	the	next	highest	
relevance scores, thus leaving items that contain some contractual
matches.	For	example,	such	as	a	cover	letter	for	a	contract	with	
details	on	start	and	termination	notices.		




                                                                           7
Information	Normalisation

           One area commonly overlooked and misunderstood within a search
           process	is	information	normalisation.	Information	Normalisation	is	the	
           process	of	automatically	determining	the	correct	value	when	ambigu-
           ity	exists.	This	can	be	most	often	seen	when	processing	dates.

           Date formats can be US English, UK English, European and many
           others,	with	short,	long	and	textual	dates	being	used.	An	example	
           of	this	is	01/06/10.	This	date	can	be	the	6th	of	January	2010	or	the	
           1st	of	June	2010	based	on	only	US	and	UK	formats.	If	this	is	further	
           extrapolated	to	word	based	dates,	this	becomes	the	First	day	of	
           January	2010.	It	is	clear	that	normalisation	absolutely	needs	to	take	
           place.

           The	Normalisation	process	covers	not	only	dates;	it	also	covers	
           locations,	people	and	companies,	where	short	names	or	abbreviations	
           are	used.

           The cDiscovery process, unlike search engines, needs to understand
           the	relevance	and	context	of	dates	and	formats	within	contracts.	It	
           also	needs	to	normalise	the	information.	Without	it	you	could	miss	a	
           renewal	or	termination	date	by	6	months,	referring	to	our	example	
           above.
           This process of understanding the local and relevance of the
           information	is	a	key	differentiator	between	cDiscovery	and	Search	or	
           eDiscovery	methods.			
May 2011




 8
High	Risk	/	Value	Clause	Detection

Another	benefit	of	cDiscovery,	is	its	ability	to	identify	contractual	
clauses	or	wordings	that	present	risk	or	value	to	the	organisation.	
Once	such	example,	is	the	“Assignment	“clause	within	many	contracts.	
The	contracting	parties	either	have,	or	don’t	have	the	right	to	assign	
the	contract	during	a	sale,	merger	or	outsourcing	event.

Recognition	of	the	risk	within	the	clause	is	also	extended	to	
understanding	dates	and	relative	time	periods.	Take	for	example	a	
conditional	assignment	of	a	contract,	where	the	main	contracting	
party	is	given	the	right	to	assign	but	must	first	provide	28	days	written	
notice	to	all	parties.

Again	the	detection	of	relevance,	proximity,	durations	and	the	
normalisation	of	values	is	required.	Thus	to	understand	the	inherent	
risk	or	value	of	a	clause	or	body	of	text,	the	cDiscovery	solution	must	
correlate	multiple	values.

The	ability	to	quickly	extended	and	tailor	the	detection	and	extraction	
of	key	contextual	metadata,	is	also	a	critical	aspect	of	the	cDiscovery	
process.	An	item	of	value	to	one	company	can	be	seen	as	high	risk	to	
another.	Thus	a	cDiscovery	solution	must	have	the	ability	to	“learn”	so	
that	it	can	be	tailored	to	customers’	needs	based	on	“teaching”.	This	
iterative	process	improves	and	refines	cDiscovery’s	overall	accuracy	
and	precision.

Search	and	standard	eDiscovery	methods	are	not	well	positioned	
to	provide	this	level	of	correlation,	they	are	designed	to	provide	
fast access to result sets over millions of documents, but leave the
correlation	and	understanding	to	the	user.




                                                                             9
cDiscovery Extensions over
           Search and eDiscovery
           Within	the	preceding	sections,	reference	has	been	made	to	
           cDiscovery	functionality	and	how	this	differs	from	within	a	standard	
           search	solution.	To	further	understand	the	extensions	provided	over	
           and	above	search	and	eDiscovery,	some	key	functional	extensions	are	
           required.



           OCR and spell checking

           As	many,	if	not	all,	images,	files,	TIFF,	GIFF,	PDF	etc,	are	embedded	
           into	contracts.	OCR	processing	and	information	capture	therefore	
           needs	to	be	performed.	During	this	process,	the	quality	of	the	
           scanned	images	can	introduce	noise	and	errors	within	the	text.	The	
           errors	introduced	could,	if	left	unmanaged,	cause	contracts	to	be	
           missed	during	the	discovery	phase.

           To counter and eliminate, where possible, errors of this nature spell
           checking	and	intelligent	processing	needs	to	be	performed.	Intelligent	
           processing	of	spelling	mistakes	is	where	the	application	again	looks	
           to the surrounding wordings and phrases to determine the best
           contextual	and	relevant	replacement	for	an	incorrectly	spelled	word.	

           While	some	search	engines	do	provide	spelling	suggestions	and	
           corrections,	this	is	based	primarily	on	the	Levenshtein	distance	or	
           dictionary	based	lookups	of	common	words.	While	this	might	work	
           for searching and eDiscovery methods alone, it does not provide the
           required	relevance	and	proximity	calculations.	This	method	also	relies	
           on users typing errors, rather than errors being correct at the source
           of	extraction.		
May 2011




 10
Table	Recognition	and	
Information	Extraction

With	eDiscovery	and	Search	being	targeted	at	finding	as	much	
information	as	possible	and	leaving	the	processing	to	the	users,	
formatting	and	tabular	data	is	not	processed	in	context.	While	this	is	
OK	for	searches	and	eDiscovery	tasks,	when	dealing	with	relational	
contractual	data,	tabular	information	must	be	accounted	for.

Take for example a pricing structure that is based on a table, with
dates, items and values including a total contractual value within the
cells.	In	most	if	not	all	cases	the	eDiscovery	and	search	engines	will	
extract all the headings followed by all the data as a single stream of
text,	this	totally	removes	any	relationship	between	the	headings,	cells	
and columns, thus making it impossible to determine the context and
relevance	of	the	information.

While	cDiscovery	relies	on	being	able	to	process	information	within	
context,	it	is	imperative	that	it	maintains	the	linkage	between	items.	
Therefore,	it	is	required	to	be	able	to	process	tabular	information	
within the tables, which can become challenging when dealing with
image	based	items.				



Signature	Detection

To	further	reduce	the	possibility	of	false	positive	results	and	to	target	
signed contracts, the capability to detect a possible signature within
the	contract	should	be	available.	With	this	detection,	users	can	be	
presented with a targeted set of contracts that have a very high
confidence	level	of	being	entered	into	contracts.	

Combining	this	with	the	extracted	information,	termination	and	
renewal	dates	or	notice	periods,	a	risk	and	value	matrix	can	be	quickly	
determined.	



                                                                             11
Customer	Specific	Information

           A	further	extension	that	the	cDiscovery	solution	provides	is	the	ability	
           to	allow	the	application	to	“learn”	about	the	environment	it	has	been	
           installed	into.	In	much	the	same	was	as	a	child	learns	by	examples	and	
           reference	information,	the	cDiscovery	solution	should	be	able	to	learn	
           as	well.	It	should	not	only	be	able	for	example	to	recognise	a	simple	
           list	say	of	companies	or	people,	that	are	known	to	the	organisation;	
           it	should	also	be	able	to	quickly	use	and	incorporate	this	information	
           into	the	processing	algorithms	to	improve	accuracy	and	extraction	of	
           relevant	information.
May 2011




 12
Search Methods
As with eDiscovery, cDiscovery requires a search engine to process
and	present	information	to	users.	Thus,	allowing	users	to	search	
based	on	the	full	text	information	within	the	contracts	or	the	
proactive	extraction	of	contextual	information.

Because	of	the	proactive	extraction	of	information,	cDiscovery	
solutions	can	present	users	with	information	without	the	users	
knowing	what	they	are	looking	for.	An	example	of	this	type	of	
information	management	and	presentation	is	the	contracting	party	
and	contract	type.

Take	the	first	example	of	“Sony	Music”.	However	this	time	the	user	is	
presented with a view that lists groups of all contracts based on the
type,	say	Intellectual	Property	Sale	and	Sales	Contracts.	At	this	point	
the	user	only	needs	to	select	the	view	or	Faceted	search	view,	to	see	
only	the	contracts	relating	to	its	type.	Add	to	this	ability	to	the	search	
for	Contracting	Parties	of	Sony	Music,	and	the	system	presents	the	
user with accurate and targeted results with the ability to view all the
extracted	information	within	a	single	view.	



Search	and	eDiscovery	limitations

The main challenges faced by Search and eDiscovery methods today,
are	lost	metadata	and	formatting	when	documents	are	converted	to	
image	type	files.	Most	if	not	all	entered	into	contracts	are	image	files,	
with historical data almost always being faxed versions of the original
signed	contract.

As	has	been	previously	detailed,	the	loss	of	formatting	and	metadata	
causes	the	applications	to	only	extract	streams	of	text,	depending	on	
if	an	OCR	process	has	been	used.	Even	with	the	OCR	process,	little	or	
no	error	correction	and	information	correlation	is	performed	by	the	
eDiscovery and Search engines, thus introducing errors within the
extracted	text.

                                                                              13
With the loss of the metadata and the induced errors, accurate
           discovery	and	classification	of	contracts	becomes	a	significant	
           challenge, and one that current Search and eDiscovery engines cannot
           meet.				




           Summary
           It	should	be	clear	that	within	an	effective	contracts	discovery	process,	
           additional	functions	and	methods	are	needed	over	and	above	what	
           Search	and	eDiscovery	offer.	

           cDiscovery	is	a	combination	of	Search,	eDiscovery,	complex	document	
           processing	and	targeted	logic	functions,	for	the	proactive	extraction	
           and	presentation	of	information	within	context.	The	Search	and	
           eDiscovery	processes	provide	information	reactively,	relying	on	
           users’	knowledge	and	efforts	to	complete	the	processing.	cDiscovery	
           provides	proactive	presentation,	as	well	as	warnings	on	pending	
           contractual	obligations	or	milestones.

           cDiscovery should be seen as a logical extension to any eDiscovery
           process,	as	the	information	discovered	and	extracted	can	be	utilised	
           by	the	eDiscovery	engines.	Further	to	this,	standard	web	services	
           interfaces are provided within the Search, eDiscovery and cDiscovery
           applications.		Processing	of	the	correct	information	can	therefore	
           occur	within	the	appropriate	application,	with	information	flowing	
           seamlessly	between	each	function.

           With	new	regulations	and	reporting	rules,	companies	can	no	longer	
           ignore	contractual	information	within	their	environments.		No	
           Enterprise Search or eDiscovery engine is complete without the
           complement	of	cDiscovery	processing.
May 2011




           COMMERCIAL	IN	CONFIDENCE	
           ©	Copyright	2011.			Seal	Software	Solutions	Limited.		All	rights	reserved.		
           The	contents	of	this	document	are	commercial	in	confidence	and	are	not	to	be	copied	or	supplied	in	
 14        part	or	whole	to	third	parties	without	the	prior	written	consent	of	Seal	Software	Solutions	Limited.

More Related Content

Similar to Whitepaper Introducing C Discovery

Over 70% of all acquisitions never realize their potential
Over 70% of all acquisitions never realize their potentialOver 70% of all acquisitions never realize their potential
Over 70% of all acquisitions never realize their potentialTom Rieger
 
How AI is changing legal due diligence
How AI is changing legal due diligenceHow AI is changing legal due diligence
How AI is changing legal due diligenceImprima
 
Imprima | How AI is Changing Legal Due Diligence
Imprima | How AI is Changing Legal Due DiligenceImprima | How AI is Changing Legal Due Diligence
Imprima | How AI is Changing Legal Due DiligenceImprima
 
GDPR READY SOLUTION FOR UNSTRUCTURED DATA
GDPR READY SOLUTION FOR UNSTRUCTURED DATAGDPR READY SOLUTION FOR UNSTRUCTURED DATA
GDPR READY SOLUTION FOR UNSTRUCTURED DATAXeniT Solutions nv
 
What is the Deal with Intent Data?
What is the Deal with Intent Data?What is the Deal with Intent Data?
What is the Deal with Intent Data?Infer
 
Evisort New Document Analyzer Offers Out-of-the-Box AI to Mine All A Company’...
Evisort New Document Analyzer Offers Out-of-the-Box AI to Mine All A Company’...Evisort New Document Analyzer Offers Out-of-the-Box AI to Mine All A Company’...
Evisort New Document Analyzer Offers Out-of-the-Box AI to Mine All A Company’...Evisort
 
Seal Datasheet | M&A Process
Seal Datasheet | M&A ProcessSeal Datasheet | M&A Process
Seal Datasheet | M&A Processsealsoftwaredept
 
Are you prepared for eu gdpr indirect identifiers? what are indirect identifi...
Are you prepared for eu gdpr indirect identifiers? what are indirect identifi...Are you prepared for eu gdpr indirect identifiers? what are indirect identifi...
Are you prepared for eu gdpr indirect identifiers? what are indirect identifi...Steven Meister
 
Why We Are Open Sourcing ContraxSuite and Some Thoughts About Legal Tech and ...
Why We Are Open Sourcing ContraxSuite and Some Thoughts About Legal Tech and ...Why We Are Open Sourcing ContraxSuite and Some Thoughts About Legal Tech and ...
Why We Are Open Sourcing ContraxSuite and Some Thoughts About Legal Tech and ...Daniel Katz
 
Seal Datasheet - Contract Abstraction
Seal Datasheet - Contract AbstractionSeal Datasheet - Contract Abstraction
Seal Datasheet - Contract Abstractionsealsoftwaredept
 
What is a Contract Repository.pdf
What is a Contract Repository.pdfWhat is a Contract Repository.pdf
What is a Contract Repository.pdfSirion Labs
 
En ebook-digital-signature-for-the-remote-workplace
En ebook-digital-signature-for-the-remote-workplaceEn ebook-digital-signature-for-the-remote-workplace
En ebook-digital-signature-for-the-remote-workplaceNiranjanaDhumal
 
A Guide to IT Consulting- Business.com
A Guide to IT Consulting- Business.comA Guide to IT Consulting- Business.com
A Guide to IT Consulting- Business.comBusiness.com
 
SirionLabs Webinar Featuring Forrester - Plugging Value Leakage in IT Outsour...
SirionLabs Webinar Featuring Forrester - Plugging Value Leakage in IT Outsour...SirionLabs Webinar Featuring Forrester - Plugging Value Leakage in IT Outsour...
SirionLabs Webinar Featuring Forrester - Plugging Value Leakage in IT Outsour...SirionLabs
 
ZyLAB White Paper - Bringing e-Discovery In-house
ZyLAB White Paper - Bringing e-Discovery In-houseZyLAB White Paper - Bringing e-Discovery In-house
ZyLAB White Paper - Bringing e-Discovery In-houseZyLAB
 
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...Findwise
 
Susan Maples - PWYP Montreal Conference 2009
Susan Maples - PWYP Montreal Conference 2009Susan Maples - PWYP Montreal Conference 2009
Susan Maples - PWYP Montreal Conference 2009Publish What You Pay
 
Convergence Article. ACC Docket June 2015
Convergence Article. ACC Docket June 2015Convergence Article. ACC Docket June 2015
Convergence Article. ACC Docket June 2015Joseph Perkins
 
Discussion Task #1 Research· Scan and analyze the infogra.docx
Discussion Task        #1 Research· Scan and analyze the infogra.docxDiscussion Task        #1 Research· Scan and analyze the infogra.docx
Discussion Task #1 Research· Scan and analyze the infogra.docxpauline234567
 
Contract Management and Technology - seal software
Contract Management and Technology - seal softwareContract Management and Technology - seal software
Contract Management and Technology - seal softwaresealsoftwaredept
 

Similar to Whitepaper Introducing C Discovery (20)

Over 70% of all acquisitions never realize their potential
Over 70% of all acquisitions never realize their potentialOver 70% of all acquisitions never realize their potential
Over 70% of all acquisitions never realize their potential
 
How AI is changing legal due diligence
How AI is changing legal due diligenceHow AI is changing legal due diligence
How AI is changing legal due diligence
 
Imprima | How AI is Changing Legal Due Diligence
Imprima | How AI is Changing Legal Due DiligenceImprima | How AI is Changing Legal Due Diligence
Imprima | How AI is Changing Legal Due Diligence
 
GDPR READY SOLUTION FOR UNSTRUCTURED DATA
GDPR READY SOLUTION FOR UNSTRUCTURED DATAGDPR READY SOLUTION FOR UNSTRUCTURED DATA
GDPR READY SOLUTION FOR UNSTRUCTURED DATA
 
What is the Deal with Intent Data?
What is the Deal with Intent Data?What is the Deal with Intent Data?
What is the Deal with Intent Data?
 
Evisort New Document Analyzer Offers Out-of-the-Box AI to Mine All A Company’...
Evisort New Document Analyzer Offers Out-of-the-Box AI to Mine All A Company’...Evisort New Document Analyzer Offers Out-of-the-Box AI to Mine All A Company’...
Evisort New Document Analyzer Offers Out-of-the-Box AI to Mine All A Company’...
 
Seal Datasheet | M&A Process
Seal Datasheet | M&A ProcessSeal Datasheet | M&A Process
Seal Datasheet | M&A Process
 
Are you prepared for eu gdpr indirect identifiers? what are indirect identifi...
Are you prepared for eu gdpr indirect identifiers? what are indirect identifi...Are you prepared for eu gdpr indirect identifiers? what are indirect identifi...
Are you prepared for eu gdpr indirect identifiers? what are indirect identifi...
 
Why We Are Open Sourcing ContraxSuite and Some Thoughts About Legal Tech and ...
Why We Are Open Sourcing ContraxSuite and Some Thoughts About Legal Tech and ...Why We Are Open Sourcing ContraxSuite and Some Thoughts About Legal Tech and ...
Why We Are Open Sourcing ContraxSuite and Some Thoughts About Legal Tech and ...
 
Seal Datasheet - Contract Abstraction
Seal Datasheet - Contract AbstractionSeal Datasheet - Contract Abstraction
Seal Datasheet - Contract Abstraction
 
What is a Contract Repository.pdf
What is a Contract Repository.pdfWhat is a Contract Repository.pdf
What is a Contract Repository.pdf
 
En ebook-digital-signature-for-the-remote-workplace
En ebook-digital-signature-for-the-remote-workplaceEn ebook-digital-signature-for-the-remote-workplace
En ebook-digital-signature-for-the-remote-workplace
 
A Guide to IT Consulting- Business.com
A Guide to IT Consulting- Business.comA Guide to IT Consulting- Business.com
A Guide to IT Consulting- Business.com
 
SirionLabs Webinar Featuring Forrester - Plugging Value Leakage in IT Outsour...
SirionLabs Webinar Featuring Forrester - Plugging Value Leakage in IT Outsour...SirionLabs Webinar Featuring Forrester - Plugging Value Leakage in IT Outsour...
SirionLabs Webinar Featuring Forrester - Plugging Value Leakage in IT Outsour...
 
ZyLAB White Paper - Bringing e-Discovery In-house
ZyLAB White Paper - Bringing e-Discovery In-houseZyLAB White Paper - Bringing e-Discovery In-house
ZyLAB White Paper - Bringing e-Discovery In-house
 
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...
Enterprise Search White Paper: Beyond the Enterprise Data Warehouse - The Eme...
 
Susan Maples - PWYP Montreal Conference 2009
Susan Maples - PWYP Montreal Conference 2009Susan Maples - PWYP Montreal Conference 2009
Susan Maples - PWYP Montreal Conference 2009
 
Convergence Article. ACC Docket June 2015
Convergence Article. ACC Docket June 2015Convergence Article. ACC Docket June 2015
Convergence Article. ACC Docket June 2015
 
Discussion Task #1 Research· Scan and analyze the infogra.docx
Discussion Task        #1 Research· Scan and analyze the infogra.docxDiscussion Task        #1 Research· Scan and analyze the infogra.docx
Discussion Task #1 Research· Scan and analyze the infogra.docx
 
Contract Management and Technology - seal software
Contract Management and Technology - seal softwareContract Management and Technology - seal software
Contract Management and Technology - seal software
 

Whitepaper Introducing C Discovery

  • 1. Introducing cDiscovery Why search and eDiscovery are inadequate when trying to locate contracts in an enterprise 1 Whitepaper
  • 2. Contents Background 3 Contractual Information Formats 5 Relational and Relevance 6 Automated Contract Recognition 7 Information Normalisation 8 High Risk / Value Clause Detection 9 cDiscovery Extensions over Search and eDiscovery 10 OCR and spell checking 10 Table Recognition and Information Extraction 11 Signature Detection 11 Customer Specific Information 12 Search Methods 13 Search and eDiscovery limitations 13 Summary 14 May 2011 2
  • 3. Background Within today’s enterprise are many different document types held within many differing information sources. One such information type is contracts, with differing document formats such as scanned images, office files and PDF’s. Historically contracts have been held within file shares with limited metadata attached to them as TIFF or image embedded PDF files, originating from scanners or email attachments. Due to the nature of contractual information, contracts need to be signed by both parties, and as such the final contract would be held as either the original paper document or, in most cases, faxed back to the contracting parties. Within each business unit and geographical location, various contract management policies could have been deployed. This can create a dispersed and highly irregular contract management environment. Adding to the complexity, each location or department could have deployed and used many differing contract templates and have received hundreds of various inbound contract formats. As many of the formats, layouts and information will be unknown, searching for information becomes an arduous task. With every different contracting party, for example, the number of combinations and “false positives” will increase, based on standard eDiscovery or search methods. A false positive is defined as: - “relating to or being an individual or a test result that is erroneously classified in a positive category” Relating this directly back to search, you are given a result that matches your query, however it is not a correct match. A common misunderstanding regarding eDiscovery is it is perceived to provide rich contextual information. Some eDiscovery solutions use digital fingerprints, NIST’s, to remove non relevant data to improve the relevancy of result sets. This approach however is the direct inverse of the cDiscovery process, which classifies IN information. Additionally, eDiscovery uses search within the processing with the same functional limitations relating to contractual information. 3
  • 4. An example of the size of the issue is best illustrated with a simple example based on the contracting parties of a contract. Each contracting party could be a person, company, organisation, country, department or any other combination of addresses and entities. To effectively search, locate and review contracts based on contracting parties alone, a user would need to know the parties before starting the search, filter out the average 80%+ false positives, open them and review each item for the correct data. The reason why the false positives could be so high is a factor of the search. Take for example a contracting party of Sony Music. Searching for this alone would produce a result set consisting of every item where Sony Music is mentioned. The search is not able to differentiate between a simple document referring to a music track held by Sony Music, or a contract where Sony Music is one of the contracting parties. A simple test of this is to place the following search into Google, “Sony music + “contracts”, this will result in over 1.2 million hits, with very few if any being actual contracts. If we further refine the search with “Sony music” +”contracting party”, the results are just over 1200 items. However of those items none are actual contracts where Sony Music is the contracting party. While a Google search of the internet is not a direct representation of internal customer environments, the challenges remain the same. With many organisations already using an internal search engine or eDiscovery, they do not provide the required functionality to meet the challenge of locating contractual information location. This simple illustration shows that to effectively search, discover and manage contracts a new approach is required, targeted at the information held within contracts. With this in mind, a new technology and methodology is needed. We will at this point in the whitepaper, introduce a new technology called cDiscovery. This term will be used to refer to the discovery of contracts throughout the rest of this document. Currently the only May 2011 cDiscovery solution on the market, and the basis for reference within this document, is the Seal Software cDiscovery solution. 4
  • 5. Contractual Information Formats To further understand cDiscovery’s importance, it is necessary to consider contractual formats and layouts. Contracts in many cases can be free form information items, with dates, parties, clauses and obligations randomly distributed within the documents. With the majority of contracts being in image formats, Optical Character Recognition (OCR) is required to extract information. During this extraction process, errors due to poor quality can be introduced, for example an “I” becomes an “!”, thus causes the “client” to become the “cl!ent”. Further formats, such as images imbedded within PDF files, also require specific handling and processing. For example when processing PDF files, does the system use a PDF ifilter or equivalent or does it process the items via an OCR engine because it contains embedded images? Not only are the actual document formats to be considered, the layouts within the documents must also be recognised. Take for example a contract with a table detailing the contract party, contract values and jurisdiction, the relation of the headings and cells needs to be understood to enable effective processing and discovery. Simple text extraction is not capable of producing a relational view or data correlation between cells. 5
  • 6. Relational and Relevance One of the main advantages of cDiscovery over Search and eDiscovery, is its ability to determine the relational mapping and relevance between information within the contract. To illustrate this point, a contract or document can contain a location, say New York. To determine the relevance and importance of this location, the system must process the preceding and following terms, words, phrases and sentences. Thus within the processing of the location, the system must first discover the location, investigate its context and then extract the information if required. The process of identifying the contracts Jurisdiction is a good example of relevance. The actual contract might have many differing locations, countries or states listed within it. Thus to determine the Jurisdiction, Governing Law and Applicable Law all need to be accounted for. This can only be done when an understanding of the relevance and relational positioning of the relevant terms is understood. While standard search engines are capable of determining locations and presenting filters based on them, they don’t present the user with information targeted at the relevant and relational level. In many cases only the location is accounted for. Further illustration of relevance and relational information can be applied to the first example. Let’s take cDiscovery as the discovery engine instead of a Search or eDiscovery engine. The search results now return only items where “Sony Music” is actually listed as the contracting parties, thus reducing the amount of “noise” and false positive results. May 2011 6
  • 7. Automated Contract Recognition Even with the relevance and relational awareness detailed above, further methods are needed to detect contracts and provide users with a simple proactive view of the contractual information. One area where this is important is within the actual recognition of the contract type. cDiscovery solutions present a graded level of confidence on items classified as contracts. This is important as false positives will likely occur, though greatly reduced. To present a graded confidence level on discovered contracts, the system needs to extract the “type” of contract discovered. To effectively do this, not only are the relational and relevance methods needed, but also the dynamic building of contract types is required. Take a simple example, a Non-Disclosure Agreement could be listed within a contract as Non-Disclosure, Non-Disclosure Agreement, NDA, Mutual NDA etc. One can see there are many differing combinations for the same contract type. Thus to ensure that the correct contract type is applied, dynamic building of the contract types based on wording, phrases and relational information needs to be applied. This is a significant benefit of the cDiscovery methodology and application over standard search and eDiscovery methods. Once a contract type has been identified, the confidence level that the item is actually a contract is at its highest level. There are contracts that will be extracted with no contract type, but contain relevant contractual information. These are given the next highest relevance scores, thus leaving items that contain some contractual matches. For example, such as a cover letter for a contract with details on start and termination notices. 7
  • 8. Information Normalisation One area commonly overlooked and misunderstood within a search process is information normalisation. Information Normalisation is the process of automatically determining the correct value when ambigu- ity exists. This can be most often seen when processing dates. Date formats can be US English, UK English, European and many others, with short, long and textual dates being used. An example of this is 01/06/10. This date can be the 6th of January 2010 or the 1st of June 2010 based on only US and UK formats. If this is further extrapolated to word based dates, this becomes the First day of January 2010. It is clear that normalisation absolutely needs to take place. The Normalisation process covers not only dates; it also covers locations, people and companies, where short names or abbreviations are used. The cDiscovery process, unlike search engines, needs to understand the relevance and context of dates and formats within contracts. It also needs to normalise the information. Without it you could miss a renewal or termination date by 6 months, referring to our example above. This process of understanding the local and relevance of the information is a key differentiator between cDiscovery and Search or eDiscovery methods. May 2011 8
  • 9. High Risk / Value Clause Detection Another benefit of cDiscovery, is its ability to identify contractual clauses or wordings that present risk or value to the organisation. Once such example, is the “Assignment “clause within many contracts. The contracting parties either have, or don’t have the right to assign the contract during a sale, merger or outsourcing event. Recognition of the risk within the clause is also extended to understanding dates and relative time periods. Take for example a conditional assignment of a contract, where the main contracting party is given the right to assign but must first provide 28 days written notice to all parties. Again the detection of relevance, proximity, durations and the normalisation of values is required. Thus to understand the inherent risk or value of a clause or body of text, the cDiscovery solution must correlate multiple values. The ability to quickly extended and tailor the detection and extraction of key contextual metadata, is also a critical aspect of the cDiscovery process. An item of value to one company can be seen as high risk to another. Thus a cDiscovery solution must have the ability to “learn” so that it can be tailored to customers’ needs based on “teaching”. This iterative process improves and refines cDiscovery’s overall accuracy and precision. Search and standard eDiscovery methods are not well positioned to provide this level of correlation, they are designed to provide fast access to result sets over millions of documents, but leave the correlation and understanding to the user. 9
  • 10. cDiscovery Extensions over Search and eDiscovery Within the preceding sections, reference has been made to cDiscovery functionality and how this differs from within a standard search solution. To further understand the extensions provided over and above search and eDiscovery, some key functional extensions are required. OCR and spell checking As many, if not all, images, files, TIFF, GIFF, PDF etc, are embedded into contracts. OCR processing and information capture therefore needs to be performed. During this process, the quality of the scanned images can introduce noise and errors within the text. The errors introduced could, if left unmanaged, cause contracts to be missed during the discovery phase. To counter and eliminate, where possible, errors of this nature spell checking and intelligent processing needs to be performed. Intelligent processing of spelling mistakes is where the application again looks to the surrounding wordings and phrases to determine the best contextual and relevant replacement for an incorrectly spelled word. While some search engines do provide spelling suggestions and corrections, this is based primarily on the Levenshtein distance or dictionary based lookups of common words. While this might work for searching and eDiscovery methods alone, it does not provide the required relevance and proximity calculations. This method also relies on users typing errors, rather than errors being correct at the source of extraction. May 2011 10
  • 11. Table Recognition and Information Extraction With eDiscovery and Search being targeted at finding as much information as possible and leaving the processing to the users, formatting and tabular data is not processed in context. While this is OK for searches and eDiscovery tasks, when dealing with relational contractual data, tabular information must be accounted for. Take for example a pricing structure that is based on a table, with dates, items and values including a total contractual value within the cells. In most if not all cases the eDiscovery and search engines will extract all the headings followed by all the data as a single stream of text, this totally removes any relationship between the headings, cells and columns, thus making it impossible to determine the context and relevance of the information. While cDiscovery relies on being able to process information within context, it is imperative that it maintains the linkage between items. Therefore, it is required to be able to process tabular information within the tables, which can become challenging when dealing with image based items. Signature Detection To further reduce the possibility of false positive results and to target signed contracts, the capability to detect a possible signature within the contract should be available. With this detection, users can be presented with a targeted set of contracts that have a very high confidence level of being entered into contracts. Combining this with the extracted information, termination and renewal dates or notice periods, a risk and value matrix can be quickly determined. 11
  • 12. Customer Specific Information A further extension that the cDiscovery solution provides is the ability to allow the application to “learn” about the environment it has been installed into. In much the same was as a child learns by examples and reference information, the cDiscovery solution should be able to learn as well. It should not only be able for example to recognise a simple list say of companies or people, that are known to the organisation; it should also be able to quickly use and incorporate this information into the processing algorithms to improve accuracy and extraction of relevant information. May 2011 12
  • 13. Search Methods As with eDiscovery, cDiscovery requires a search engine to process and present information to users. Thus, allowing users to search based on the full text information within the contracts or the proactive extraction of contextual information. Because of the proactive extraction of information, cDiscovery solutions can present users with information without the users knowing what they are looking for. An example of this type of information management and presentation is the contracting party and contract type. Take the first example of “Sony Music”. However this time the user is presented with a view that lists groups of all contracts based on the type, say Intellectual Property Sale and Sales Contracts. At this point the user only needs to select the view or Faceted search view, to see only the contracts relating to its type. Add to this ability to the search for Contracting Parties of Sony Music, and the system presents the user with accurate and targeted results with the ability to view all the extracted information within a single view. Search and eDiscovery limitations The main challenges faced by Search and eDiscovery methods today, are lost metadata and formatting when documents are converted to image type files. Most if not all entered into contracts are image files, with historical data almost always being faxed versions of the original signed contract. As has been previously detailed, the loss of formatting and metadata causes the applications to only extract streams of text, depending on if an OCR process has been used. Even with the OCR process, little or no error correction and information correlation is performed by the eDiscovery and Search engines, thus introducing errors within the extracted text. 13
  • 14. With the loss of the metadata and the induced errors, accurate discovery and classification of contracts becomes a significant challenge, and one that current Search and eDiscovery engines cannot meet. Summary It should be clear that within an effective contracts discovery process, additional functions and methods are needed over and above what Search and eDiscovery offer. cDiscovery is a combination of Search, eDiscovery, complex document processing and targeted logic functions, for the proactive extraction and presentation of information within context. The Search and eDiscovery processes provide information reactively, relying on users’ knowledge and efforts to complete the processing. cDiscovery provides proactive presentation, as well as warnings on pending contractual obligations or milestones. cDiscovery should be seen as a logical extension to any eDiscovery process, as the information discovered and extracted can be utilised by the eDiscovery engines. Further to this, standard web services interfaces are provided within the Search, eDiscovery and cDiscovery applications. Processing of the correct information can therefore occur within the appropriate application, with information flowing seamlessly between each function. With new regulations and reporting rules, companies can no longer ignore contractual information within their environments. No Enterprise Search or eDiscovery engine is complete without the complement of cDiscovery processing. May 2011 COMMERCIAL IN CONFIDENCE © Copyright 2011. Seal Software Solutions Limited. All rights reserved. The contents of this document are commercial in confidence and are not to be copied or supplied in 14 part or whole to third parties without the prior written consent of Seal Software Solutions Limited.