Demystifying analytics in e discovery white paper 06-30-14

2014 WHITE PAPER
DEMYSTIFYING ANALYTICS
IN EDISCOVERY
Steven Toole
Vice President of Marketing, Content Analyst Company, LLC.

1© 2014 Content Analyst,LLC.All rights reserved. Content Analyst,CAATand the Content Analyst and CAATlogos are registered trademarks of Content Analyst,LLC in the
United States. All other marks are the property of their respective owners.
A recent eDiscovery Journal blog entry by Greg
Buckles pointed out that the number of companies
providing the analytics technology in early case
assessment (ECA) and review platforms is few, and
the specific analytics capabilities can vary wildly
from one platform to the next. While this may not
be breaking news to those closest to analytics in
eDiscovery, it did reveal some mysticism about
how the market uses analytics in the early stages
of eDiscovery. In addition, it raised questions about
how analytics fit into the eDiscovery workflow, and
ultimately, the return on investment that analytics
have on eDiscovery and information governance in
general.
This overview is designed to further unveil this
mysticism surrounding analytics in eDiscovery and
information governance, and provide insights about
the return on investment analytics can enable for
those who embrace these capabilities. Corporate
counsel that get ahead of the curve today with
forward-thinking strategies such as these will be
the ultimate heroes and beneficiaries of eDiscovery
analytics, leading their field with a much more
proactive and cost-effective approach to information
governance and legal technology.
The Analytics Land Grab
While there are plenty of stakes in the ground across the
eDiscoverylandscape,lawfirmsandserviceprovidersarelooking
”to the West” for unclaimed territory. The gigabyte gold rush is all
about applying analytics to the data further upstream in order
to “own” the data long before it’s needed in a matter. ECA was
the first frontier law firms and service providers looked toward in
ordertomoveupstream,butthegreenestpasturesarestillfurther
upstream, in what some call “pre-discovery.” The Golden Rule in
eDiscovery is simple: Those who ”rule” the content get the gold.
Translation:thevendorthatcanapplyanalyticsearliest–beforeit’s
needed in a matter – provides the most value to corporations, and
therefore is at a great competitive advantage.
“ECA was the first frontier law
firms and service providers looked
toward in order to move upstream,
but the greenest pastures are still
further upstream, in what some
call “pre-discovery.”
”

Dynamic Clustering – This is a good place to start, especially if
you know little to nothing about the content. Clustering “buckets”
the content (documents, emails, etc.) into natural groupings of
conceptually related materials. One major benefit of clustering is
that it provides a very fast map of the document landscape in a
highly objective, consistent, concept-aware fashion. As a result,
thereviewercanjumpstraighttotheclusterthat’sofmostinterest
(conceptually relevant), and avoid spending time in clusters of no
conceptual relevance. In terms of information governance, it can
also help identify and weed out the ROT (redundant, outdated
and transient) content quickly and easily, thus reducing costs and
helping to increase ROI.
Term Expansion – You have a keyword, name or technical term,
abbreviation, or acronym, and you want a list of all similar,
or highly related, terms so you can expand your search for
documents containing those terms as well. Term expansion
identifies conceptually related terms, customized to your content,
and ranked in order of relevance. For example, Barack Obama
might produce a list such as President Obama, Commander-in-
Chief, Senator Obama, Michelle Obama, the Oval Office, Office of
the President, POTUS, etc. In a matter, that means finding more
conceptually related content faster and easier, saving time and
money. In information governance, it helps identify content
relatedtocorporaterecords,intellectualproperty,andcompliance,
as well as, of course, more ROT for defensible deletion.
Conceptual Search – You’ve identified a key document or
paragraph, now you want to find similar ones. Keyword search
will give you documents containing the specific keywords as best
as you can write the Boolean search string, and as long as those
keywords are included in the resulting documents. But writing
Boolean search strings can be time-consuming and still may miss
key documents containing the ”unknown” terms not included in
your search string. To find the documents you’d otherwise miss
with keyword searches, you’ll need to use conceptual search.
Applying mathematical algorithms to your example document or
text selection, conceptual search looks for matching patterns in
the “map” of the data called the conceptual space. The benefit is
thatconceptualsearchcanfindsimilarresultsevenifthematching
document doesn’t contain any of the same terms as the example
text. Think abbreviations, misspellings, acronyms, code words,
andrelatedtermsyouhadn’theard.Thentakeawayfalsepositives
from synonyms and polysemes. Translation: uncover highly
relevant yet latent relationships, saving time and costs.
Auto-Categorization – Predictive coding is one area that’s
gained a lot of attention in eDiscovery over the past two years.
Auto-categorization is what makes predictive coding possible.
Predictive coding is applying machine learning to a corpus of
documents to intelligently categorize them any number of ways,
such as, as privileged, responsive, nonresponsive. For example,
users can categorize documents as responsive, then categorize
the responsive documents even further into relevant issue sub-
categories. Auto-categorization uses the same conceptual space
and sample document exemplars to find conceptually similar
documents and label them as appropriate. Again, the big benefit
here is a tremendous amount of time saving (and cost saving) by
letting the technology bring the most relevant documents to the
forefront, and into the hands of the domain experts, as quickly and
easily as possible.
Email Threading – The concept of email threading is fairly
simple – find the subset of emails at the end of each branch of
a conversation thread. Rather than reading 30 emails back and
forth – as well as sideways among forwarded branches of the
conversation – email threading finds the subset of emails that
includeallofthepreviousreplies(called”inclusive”emailsbecause
these six, for example, include the whole history). Time and cost
savings of using email threading are self-evident, but it’s also
important to note that threading reveals exactly who knew what,
when – pretty critical in piecing together the course of events that
unfolded surrounding a matter.
Near-DuplicateIdentification–Asimilarbenefittoemailthreading
exists with near-duplicate identification.While conceptual search,
clustering or categorization can identify documents that are
relevant to the case, many could be various versions of the same
document. Knowing that they’re near duplicates of each other
can save the time of having to review each one. If it’s important to
know what changed from one version to the next, when, and by
whom, difference highlighting shows these changes, again saving
time and reducing cost. Batching near duplicates together from
theoutsetofamatteralsoprovidesreviewerswithamorefocused
set of documents.
Recipe for Success
This all sounds good, but what does this really mean? You can’t bake a cake until you know what the ingredients are, what they do, and how
they can affect the output. And since each eDiscovery solution has a different set of analytics capabilities, here’s a brief tutorial on the key
ingredients of analytics for eDiscovery and information governance.
© 2014 Content Analyst,LLC.All rights reserved. Content Analyst,CAATand the Content Analyst and CAATlogos are registered trademarks of Content Analyst,LLC in the
2

3© 2014 Content Analyst,LLC.All rights reserved. Content Analyst,CAATand the Content Analyst and CAATlogos are registered trademarks of Content Analyst,LLC in the
Putting Your Data on a Diet
Putting these analytics capabilities to work for you may cause
serious weight loss in your data. In eDiscovery, that means fewer
documents to review by expensive reviewers. It also means that
the documents they are reviewing are nothing but the absolutely
most conceptually relevant documents to the case. Moreover,
reviewers are being presented with documents that are not
batched haphazardly, allowing for a more focused review, driving
accuracy to an all-time high and costs and time even lower.
But Wait – There’s More!
Remember the gigabyte gold rush from above? The Golden Rule
of data? This is where analytics really are the key to unlocking all
those hidden insights in a company’s data – long before they’re
needed in eDiscovery. Applying text analytics to a company’s
electronic records proactively through pre-discovery means that
data is already organized, reduced, and ready to be presented
if and when a matter arises. Corporate counsel love this idea
because it keeps litigation costs as low as possible, decreases the
crucial time it takes to investigate or decide whether to settle a
case, and helps them present their side in the very best possible
light.
The Bottom Line: ROI for eDiscovery Analytics
Measuring the ROI of using analytics in eDiscovery comes down
to this: Review is the greatest cost factor in eDiscovery. Expert
reviewers don’t come cheap – their expertise is clearly of utmost
value in a case. But if a large percentage of their time is spent
reviewing documents not relevant to a case, then you’re not
getting the most value out of them in the first place. Their hours
cost the same whether they’re looking at the smoking gun
document in a case or something completely unrelated. You
wouldn’t wear gloves during a palm reading – they’d just get
in the way of the psychic doing her job. Presenting a corpus of
documents to expensive reviewers without applying machine
learning first makes no more sense. Further, finding documents
otherwise missed without analytics can also hinder the experts’
ability to formulate your case strategy.
But the ROI of applying analytics to your documents and email
pre-discovery goes even further. The cost benefits of organizing
and analyzing your content proactively are huge, helping to
drive decision making and information governance practices for
compliance, risk mitigation and cost avoidance. Applying these
strategies early can provide a tremendous advantage long term
through a much more proactive and cost-effective approach to
information governance and eDiscovery.
Steven Toole is vice
president of marketing
at Content Analyst
Company where his unique
combination of business
acumen, creativity,
strategic vision and tactical
execution yield impactful
results toward the
company's mission.
“Measuring the ROI of using
analytics in eDiscovery
comes down to this: Review
is the greatest cost factor in
eDiscovery.
”

Demystifying analytics in e discovery white paper 06-30-14

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (20)

Similar to Demystifying analytics in e discovery white paper 06-30-14

Similar to Demystifying analytics in e discovery white paper 06-30-14 (20)

Recently uploaded

Recently uploaded (20)

Demystifying analytics in e discovery white paper 06-30-14