2012 11 7 TAR Webinar Part 3 Sigler

Demys&fying
Technology

Assisted
Review

Part
3:
Deconstruc&ng
the

Technology

Sonya
L.
Sigler

Agenda

  Review/Overview

  Underlying
Search
Technology

  dtSearch

  Lucene
(open
source)

  Others
–
My
SQL,
etc.

  Underlying
StaCsCcal
Based
Technology

  Rules
Based
Technology
(LinguisCc
or
StaCsCcal)

  Bayesian
ProbabilisCc
Technologies

  Latent
SemanCc
Indexing

  Q
&
A

Demys&fying
Technology
Assisted
Review

Review/Overview
-‐
Search
&
Review
Spectrum

Linear
Review

Culling

IteraCve
search

Review

Accelerated
Review

Email
Threading

Near
Duplicate
DetecCon
Automated
Review

Per

CA
-‐
Clustering
Relevance
Ranking

Document
CategorizaCon
(Supervised)
Machine
Learning

Cost

Latent
SemanCc
Indexing

(staCsCcal
probability)

PaRern
Analysis

Sampling
Data
for
High

Precision
and
Recall
Rates

Organiza3on
Commitment

Demys&fying
Technology
Assisted
Review

Underlying
Technologies

Rules
Based
Systems

dtSearch

Key
word
Search

Ontologies

Lucene
Other
Search
Engines

LinguisCc
–
word
based

StaCsCcal
-‐
#s
based

Bayesian
ClassiﬁcaCon

Support
Vector
Models

Latent
SemanCc
Indexing

Demys&fying
Technology
Assisted
Review

Database
NormalizaCon

From:
Nuala
Coogan
Nuala@SFLData.com

Subject:
EDI
Summit
–
Florida

Date:
October
3,
2012
10:11:21
AM
PDT

To:
Sigler
L.
Sonya
Sonya@sigler.name

From:
Nuala
Coogan

Subject:
EDI
Summit
–
Florida

Date:
10/03/12

To:
Sonya
Sigler

Demys&fying
Technology
Assisted
Review

TokenizaCon

  Words,
Phrases,
Symbols

  Mostly
at
the
word
level

  Numbers

  PunctuaCon

  Meaningful
Elements
or
Pieces
–>
Tokens

  Parsing
and
Text
Mining

  Treatment
of
ContracCons,
Hyphenated
words,

EmoCcons
and
Larger
Constructs
(like
urls)

  Look-‐up
tables

Demys&fying
Technology
Assisted
Review

LinguisCc
Based
Technologies

Keyword
Sample
Ontology
Sample

Simple:

q
((+(std:%CapacityReports_%
std:%DINCapacity_
%)

(std:%ACMEEPPlant_%
std:%ProductName_%))

"legal
systems"
OR
legalsystems
(+(std:%ACMEPNPlant_%
std:%ProductName_%)
+

"Mike
Custodian”

(std:%ProducCveCapability_%
std:
%CapacityReports_%))
(+(std:%CapacityCreep_%

std:%OperaConsImprovement_%
std:

Medium:
%CapacityExpansion_%
std:%CapacityRestoraCon_

mail(custodian@domain.com)
AND
"legal
systems”
%)
+(std:%ACMEPNPlant_%
std:%ProductName_
%))
(+(std:%EquipmentReplacement_%
std:

(Custodian
w/3
(Mike
OR
Michael
OR
M))
%FinishingColumn_%)
+(std:%ACMEPNPlant_%
std:
%ProductName_%))
(std:%Audit_%
actor:%Audit_

Complex:
%)
(+(std:%SeRlementNegoCaCons_%
std:
%ContractNegoCaCons_%
)
+(actor:

(privilege
OR
privileged
OR
legally
OR
"work
%ACMEOutsideCounsel_%
std:
product")
NOT
w/35
(((original
OR
intended
OR
%ACMEOutsideCounsel_%
actor:%ACME

designated
OR
named)
w/3
(recipient
OR
UBOutsideCounsel_%
std:
recipients
OR
addressee
OR
addressees
OR
%AcmeSubOutsideCounsel_%
actor:%AcmeSub_%

solely))
OR
("message
in
error")
OR
("received
in
std:%AcmeSub_%))
(std:%FTC_%
actor:%FTC_%)

((+subject:%ProductName_%
+(std:swap

error")
OR
("named
above")
OR
((electronic
or
std:"supply
agreement"
std:"exchange
agreement"

email
or
e-‐mail)
w/3
(message
or
transmission))
std:"agree
to
exchange"))
std:"name

OR
("conﬁdenCality
noCce"))

Demys&fying
Technology
Assisted
Review

Search
Engines

  dtSearch

  dtSearch
Corp.,
founded
1991

  Incorporated
into
Symantec’s
Norton
Navigator

  SDKs
available,
most
license
oﬀ
the
shelf

  hRp://support.dtsearch.com/faq/search.html

  Lucene

  Open
source
-‐
hRp://lucene.apache.org/core/

  Doug
Cukng,
1999,
Part
of
Apache
projects
in
2001

  APIs,
Customizable

  Other
–
My
SQL,
SQL,
(DBMS,
RDBMS)

Demys&fying
Technology
Assisted
Review

dtSearch

  RelaCvity,
Concordance,
Viewpoint,
others

  Single
User
desktop
license
$199

  LiRle
CustomizaCon
–
more
similariCes
across
apps

  Includes
Boolean
operators

  Includes
Proximity
searching

  Includes
Fuzzy
Searching

  Alphabet
-‐>
Alphaqet,
alpphabet,
alpkaqet

Demys&fying
Technology
Assisted
Review

Lucene

  Clearwell,
Intella,
Cataphora,
SHIFT,
others

  Open
Source
Tool
–
meant
to
be
customized

  LiRle
SimilariCes
Across
Apps

  Know
your
defaults!

  Includes
Boolean
Operators

  Includes
Proximity
Searching

Demys&fying
Technology
Assisted
Review

dtSearch
–
Fuzzy
Searching

  Degrees
of
Fuzziness

  1-‐10;
dtSearch
uses
1-‐3

  Marked
by
use
of
%
symbol

  InserCon:
co%t
→
coat

  DeleCon:
coat
→
co%t

  SubsCtuCon:
coat
→
cost

  TransposiCon
cots
→
cost

  Fuzziness
Degrees

  Alphabet
–
Alphaqet,
Alpkaqet

Demys&fying
Technology
Assisted
Review

Boolean
Operators
–
AND,
OR
,
NOT

dtSearch
Lucene

  Search
for
  Depends
on

  MulCple
words
customizaCon

treated
as
a
phrase
  OR

  ANY
–
treats
word
  AND

list
as
separated
by

OR
  Know
your
defaults

  ALL
–
treats
word
list
  Spell
out

variaCons

as
separated
by
AND

Demys&fying
Technology
Assisted
Review

Proximity

dtSearch
Lucene

  Pre/post
  w/
order
doesn’t

  w/
order
doesn’t
maRer

maRer

 House
white
  No
pre
usage

 White
house

  Pre/
ﬁnds
ﬁrst
word

prior
to
second

word

 White
house

Demys&fying
Technology
Assisted
Review

Punctua&on

dtSearch
Lucene

  LeRers
  All
punctuaCon

  Space
treated
as
a
word

  Ignored

break

  Hyphens

  %
-‐
fuzzy
searching

  _
-‐
ignored

Demys&fying
Technology
Assisted
Review

dtSearch
Hyphen
Example

Demys&fying
Technology
Assisted
Review

Noise
Words
–
Unindexed,
Ignored

dtSearch
Lucene

  Unindexed,
Can
  Ignores
*
in
quotes

create
Custom
Index
  (Quality
Control*)
=

  Many,
but
a
few
Quality
Control
but

examples:
Do,
not,
nothing
else

for,
your,
only,
under,

made,
way

  Know
defualts

Demys&fying
Technology
Assisted
Review

Stemming
v.
Wild
Cards

Stemming
Wild
Cards

  SyntacCc
VariaCons
  Strings
of
characters

  Replacements
for
beginning,

  Regular
Verbs

parts,
or
endings

  Irregular
Verbs
  Lucene
-‐
*

  dtSearch
performs
  dtSearch
-‐
?
For
single

character,
*
for
any
#
of

poorly
with
irregular

characters

verbs
  Time
consuming

  Spelling
out
recommended

  Wild
cards
in
quotes

Demys&fying
Technology
Assisted
Review

Stemming
v.
Wild
Cards
Example

Stemming
Wild
Cards

Catch
–
Lucene
Catch*

Catch~
-‐
dtSearch
  Catch

  Catch
  Catches

  Catches
  Catching

  Catching
  Catcher

  Catcher
  Catch1234
–
not
in

  Caught
-‐
not
in
dtSearch

stemming

Demys&fying
Technology
Assisted
Review

StaCsCcal
Technologies

  Rules
Based

  Bayesian
ClassiﬁcaCon

  Vector
Space
Modeling

  Latent
SemanCc
Indexing

Demys&fying
Technology
Assisted
Review

StaCsCcal
Based
Technologies

  Concept
-‐
Clustering

  Machine

  Unsupervised

  Quickly
understand

data

  Uncontrolled
Clusters

Demys&fying
Technology
Assisted
Review

StaCsCcal
Based
Technologies

  Concept
-‐
Categoriza&on

  User
Created

  Supervised

  Control
Topics

  Time
Consuming

Demys&fying
Technology
Assisted
Review

StaCsCcal
Based
Technologies

Rules
Based
Systems

  If..
Then…

  If
email
=
person
1
to
person
2
then
return
it

  If
email
=
person
1
or
person
2
then
return
it

  ArCﬁcial
Intelligence
Systems

  EnCty
extracCon
(&
dicConaries)

  Time
consuming

  Mirror
human
thinking

  Case,
subject
maRer

  Transparent
System

Demys&fying
Technology
Assisted
Review

StaCsCcal
Based
Technologies

  Bayesian
ClassiﬁcaCon

  ProbabilisCc

  Co-‐occurrence

  Frequency

  Spam
Filters

  Viagra

  Concepts

  Words,
phrases

Demys&fying
Technology
Assisted
Review

Bayesian

  Bayesian
illustraCon

  Baseball,
glove,
diamond,
bats,
hit,
home
run

  Diamond,
pendant,
jewelry

  Co-‐occurrence

  Local
–
within
a
document

  Global
–
across
document
populaCon

  Frequency
–
how
ozen
does
it
appear

  WeighCng
–
uniqueness
counts

Demys&fying
Technology
Assisted
Review

StaCsCcal
Based
Technologies

  Vector
Space
Modeling

  Latent
Seman&c
Indexing/Analysis

  Words

  Phrases,
Concepts

  Tables

  Algebraic
equaCons
represenCng
docs

  WeighCng
Algorithms

Demys&fying
Technology
Assisted
Review

Latent
SemanCc
Indexing
Example

  Exclude
Noise
Words

  The,
and,
or,
etc.

  Vector
Space
Modeling

  Build
Document
Proﬁle

  Diamond

  Base,
ball

  Necklace,
pendant

  Diamond
Saw

Demys&fying
Technology
Assisted
Review

MathemaCcal
FoundaCon

  Tables
built
with
0s,
1s

  Yes
it
has
that
word
or
phrase

  No
it
doesn’t

Demys&fying
Technology
Assisted
Review

Simple
Matrix
with
WeighCng

Demys&fying
Technology
Assisted
Review

Weighted
by
Document
(not
just
type)

Demys&fying
Technology
Assisted
Review

Defensibility
Report

  Document,
Document,
Document

  Transparency

  Workﬂow

  What
Was
Considered,
By
Whom?

  QC
Process

  Metrics

Demys&fying
Technology
Assisted
Review

Q&A - Thank you!

Post
your
ques&ons
to
the

presenter
in
the
chat
secCon

Sonya
L.
Sigler

Vice
President,
Product
Strategy
&
Consul&ng

SFL
Data

415-‐321-‐8385

sonya@sﬂdata.com

www.sﬂdata.com

Demys&fying
Technology
Assisted
Review

2012 11 7 TAR Webinar Part 3 Sigler

Recommended

Recommended

More Related Content

Similar to 2012 11 7 TAR Webinar Part 3 Sigler

Similar to 2012 11 7 TAR Webinar Part 3 Sigler (20)

2012 11 7 TAR Webinar Part 3 Sigler