2. Overview
Triggers & Preservation
• What is it?
• Why Does it Matter?
Search
Keyword Search
Clustering
Ontologies
Technology Enhanced Review - Sampling
Social Networking Analysis
Relationship Analysis
9/23/2014 2
3. “Triggers” & Preservation
What is a Trigger?
– Litigation reasonably anticipated
– Who decides
Litigation Hold Continuum
– Established in hind sight
– Threat
– Letter about litigation
– Filing Suit
Cases
– Pippin, Zubulake, Pension Committee
9/23/2014 3
4. Pippins v. KPMG
How much data to Preserve?
– All hard drives (Pippins’ position)
– 100 Sample Hard drives (KPMG’s position)
To Cooperate or NOT to Cooperate?
How Judges React to Lack of Cooperation
9/23/2014 4
5. Zubulake
Litigation Holds
– Cannot send a request into the ether
Preservation
Have to follow-up
Take affirmative steps to monitor compliance
In-house Counsel Duty
Cannot leave it to employees discretion
Document what was done
9/23/2014 5
6. Pension Committee
No intentional destruction of data
Careless & indifferent
No Latchkey Custodians (alone & unsupervised)
– Identify Custodians
– Monitor their efforts
– Including former employees and third parties
Proactive
Consistent
Reasonable Approach
9/23/2014 6
8. What To Do?
Who to include?
– Not about data volume
– Not about contact with underlying “litigation”
Key Players (Zubulake opinions)
– Likely to have relevant information
– CEO, Board, Committees, employees, etc.
Produce it from the Key Player (not others)
– Nursing Home Pension Fund v. Oracle
– Produce emails from the CEO (15) not others (1,650)
9/23/2014 8
9. Spoliation
Failure to Preserve
– Didn’t Ask
• Right person
• Right Place
– Didn’t follow up
Destruction of Data
– Intentional
– Inadvertent destruction
What can happen
– Sanctions
– Adverse Inferences
9/23/2014 9
10. Search
How to Use it To Find Information
How to Use it to Ignore Information
When to use which search methodology
9/23/2014 10
11. Search - Data Assessment
Where is the Data?
– Data Mapping - databases, servers, desktops, laptops,
IMs, smart phones, voicemail, other records
Defining Process from Collection to Review to
Production
Collection Strategy, Process, Approach
– Scope of collection: custodians, date ranges, topics
Reports on the Data Processing
– File types, encrypted files, de-duplication rates,
password protected files, encrypted files, etc.
Not Reasonably Accessible data
Assessing Risk of Data Loss
9/23/2014 11
12. Search - Case Assessment
Who - Cast of Characters
What - What the Heck Happened?
Where - Where did it take place?
When - What time period are we concerned with?
How - fraud, antitrust violation, etc.
WHY - What were the motives involved?
Data Assessment ≠ Effective Case Assessment
9/23/2014 12
13. Keyword Search Under Scrutiny
United States v. O’Keefe (Facciola)
– Questioned lawyers’ ability to decide which search terms are more likely to
produce relevant information
– Facciola has also suggested that litigants take a look at advanced search
methodologies
Victor Stanley, Inc. v. Creative Pipe, Inc. (Grimm)
– Defensibility of process AND execution lies with the party relying upon the
search protocol to meet their obligations which needs to be able to explain
search rationale, appropriateness, and proper implementation
– Advocates quality assurance, e.g. by sampling
– Searches should be designed by a competent practitioner
9/23/2014 13
14. Keyword Specific Case
William A. Gross Construction Associates, Inc. v.
American Manufacturers Mutual Insurance Company
SDNY, Judge Andrew Peck
Keyword list was in the thousands
Use the actual data set and custodians to figure out
keywords
“This case is just the latest example of lawyers designing keyword
searches in the dark, by the seat of the pants, without adequate
(indeed, here, apparently without any) discussion with those who wrote
the emails. Prior decisions from Magistrate Judges in the Baltimore-
Washington Beltway have warned counsel of this problem, but the
message has not gotten through to the Bar in this District.”
9/23/2014 14
15. $6M Keyword Mistake
In re Fannie Mae Securities Litigation
3rd Party - OFHEO
DC Circuit - Judge David Tatel
Attorney agreed to something he did NOT understand
Long list of key terms
Taxpayers suffered the consequence
9/23/2014 15
16. What This Means
• The Courts are finally
catching up
• Courts actively ruling on
Standards of Care and
Process
• Lawyers are Getting Wise
9/23/2014 16
17. Case Law Effects on Discovery
Defensibility of Review Process is now a focus
– Culling now can kill you later
– Cooperation is a hot topic
– Tussle between inside & outside counsel
– Beginning to see planning as a necessity
Increased focus on Quality
– Heightened involvement expected from corporate clients
in the overall process
– Cases pushing this, Qualcomm, Creative Pipe
9/23/2014 17
18. What Else Is There?
Effort to establish & codify uniform “Best Practices”
– Quickly becoming roadmap for uneducated industry
– Increasingly relied upon by judges as measure of reasonable or
standard behavior
Publications have addressed:
– Document retention & production
– Email management
– Search & Retrieval
– Protective orders & confidentiality
– ESI admissibility
9/23/2014 18
19. Getting to a Manageable Review Set
Intake
Data
100%
Duplicates
25%
reviewing & using the
not just filtering data
Non-
Focus on finding,
Responsive
20%
“right” data,
Produced
12.25%
Junk/Spam/
Porn
20%
NR/Priv
20%
Responsive
& Priv 15%
These figures vary based upon the data set received
9/23/2014 19
20. Search Methodologies
Visualization
Measurement
Relationship
Analysis
documents with
causal or
sequential relationship
Social Network Analysis
relationships among relevant people
Clustering Ontology
similarity of
salient features
Ontology
generalized
generalized
words or phrases
words or phrases
specific exact words,
KKeeyywwoorrdd specific exact words
Keyword
Keyword specific exact words
proximity searches, stemming
Context
Concept
Content
9/23/2014 20
21. Keyword Accuracy Example
Keyword search reduced the
document set by only 47%
And 88% of the documents
returned by keyword
search were not responsive
(Over-inclusive)
8,553 responsive documents
missed by keyword search
(Almost 8% of responsive
documents missed by
keyword search - Under-inclusive)
9/23/2014 21
22. Myth
Keyword Searching is the Way to Go
If I agree to keyword terms, I am OK
Keyword Search Cases
Keyword replacement example
Keyword substitution
Missing in Action (Under-inclusive)
Unwanted Extras (Over-inclusve)
Multiple subject/persons (Disambiguate)
9/23/2014 22
23. Fact or Myth?
Manual review by humans of large amounts of information
is as accurate and complete as possible - perhaps even
perfect - and constitutes the gold standard by which all
searches should be measured
This is “The reigning Myth of ‘perfect’ retrieval using traditional means”
Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery
The Sedona Conference Journal (2007) p. 199
Human beings retrieved less than 20% of the relevant documents when they believed they were retrieving over 75%
An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System
Blair & Maron (1985)
9/23/2014 23
24. IS 240 – Spring 2011
Blair and Maron 1985
A classic study of retrieval effectiveness
– earlier studies were on unrealistically small collections
Studied an archive of documents for a legal suit
– ~350,000 pages of text
– 40 queries
– focus on high recall
– Used IBM’s STAIRS full-text system
Main Result:
– The system retrieved less than 20% of the relevant
documents for a particular information need; lawyers
thought they had 75%
But many queries had very high precision
25. IS 240 – Spring 2011
Blair and Maron, cont.
How they estimated recall
– generated partially random samples of unseen documents
– had users (unaware these were random) judge them for
relevance
Other results:
– two lawyers searches had similar performance
– lawyers recall was not much different from paralegal’s
26. IS 240 – Spring 2011
Blair and Maron, cont.
Why recall was low
– users can’t foresee exact words and phrases that will
indicate relevant documents
• “accident” referred to by those responsible as:
“event,” “incident,” “situation,” “problem,” …
• differing technical terminology
• slang, misspellings
– Perhaps the value of higher recall decreases as the
number of relevant documents grows, so more detailed
queries were not attempted once the users were satisfied
27. Keyword Search Summary
Pro
Word Stemming
–Hous* - house, housemate,
household
Easy to use/explain/agree
Familiar
Fast results
Con
Over-inclusive
–Disambiguate
Under-inclusive
Word must be present
Hard to craft
Ineffective with short
messages, IMs
9/23/2014 27
28. Keyword Truths
Under-inclusive - missing relevant or important
info
Over-inclusive - costly to review
“Reasonable Keyword Search” doesn’t exist
Effective keyword search is difficult/impossible
– Index Data, Analyze Index
– Suggest keywords or approach
Keywords may not be appropriate for the data
Keyword Search is ONE Tool in Your Arsenal
9/23/2014 28
29. Keyword Accuracy Example
Keyword search reduced the
document set by only 47%
And 88% of the documents
returned by keyword
search were not responsive
(Over-inclusive)
8,553 responsive documents
missed by keyword search
(Almost 8% of responsive
documents missed by
keyword search - Under-inclusive)
9/23/2014 29
30. Search Methodology Continuum
Review Methodology - Decided Upfront
Identify Issues in the Case
– Formulate Queries and Approaches for Finding
Responsive Documents
– Formulate Relevancy and Responsiveness Guidelines
Identify Primary Participants
Select or Triage Documents for Review
9/23/2014 30
31. Review Tools for Relevancy Assessment
Keyword Searches, Culling
– Slices of Data are Reviewed
Categorization of Data
– Entire Dataset is Categorized
– Review Targeted Data
Automated Review
– Categorization of Dataset
– Random Sampling (Statistically Significant)
9/23/2014 31
32. Categorization of Data for Review
Categorize Entire Data Set
– Spam/Porn/System Files
– Personal/Private Data
– Non-relevant Business Data
Business Data
– Relevancy Assessment by Topic
– Privilege Review
Keyword, Topic Analysis - Overlap, Holes
9/23/2014 32
33. Search Methodologies
Visualization
Measurement
Relationship
Analysis
documents with
causal or
sequential relationship
Social Network Analysis
relationships among relevant people
Clustering Ontology
similarity of
salient features
Ontology
generalized
generalized
words or phrases
words or phrases
specific exact words,
KKeeyywwoorrdd specific exact words
Keyword
Keyword specific exact words
proximity searches, stemming
Context
Concept
Content
9/23/2014 33
35. Clustering
Clustering just means putting documents into groups that have
something in common.
Manually (that's what manual review is)
Keyword Searches
Ontologies (linguistic filters)
Automated clustering (using technology)
– Automated clustering by document type (all the Word
documents go into one basket
– Automated clustering by creation date
– Automated clustering by Actor
– Automated clustering by statistical similarity (statistical
clustering)
– ... and many other approaches
9/23/2014 35
36. Clustering -- “Options”
1 Cluster or 4 Clusters
Financial/energy
trading options
Email/computer
menu-driven
options
Stock options
(ISO's)
The generic idea of
an available choice of
action
9/23/2014 36
37. Clustering
Software implements statistical
methods of finding groups of “similar”
documents
– “Similar” must be defined appropriately
for the application
Documents are categorized with very
little effort by the user
May help with document review
– A single reviewer can look at similar
documents together, produce
consistent review decisions
– Tight clustering can be used to detect
“near duplicates” caused by OCR
errors
9/23/2014 37
38. Clustering vs. queries
Clustering is unpredictable compared to keywords or
taxonomies
The items that look very similar (to the clustering
algorithm) may not actually be similar in ways that
matter
– Relevancy may depend upon fine legal distinctions
– May vary in the same matter by subpoena and/or
jurisdiction
9/23/2014 38
39. Ontologies
Implement ontologies for directed searches.
– Approach searching from a knowledge-representation viewpoint
– Field is 25 years old, lots of work done
– Advantages:
• Disambiguate different meanings of the same word from their
context
More accurate
• Encapsulate many ways of saying the same thing
More thorough
• Search for concepts, not individual words
More intuitive, more reusable, and faster
Can be combined with other methods (unsupervised
clustering, discussions).
9/23/2014 39
41. A More Realistic Ontology
ROYALTY CONCEPT
• royalty
• royalties
• rty
• commission
• commissions
• comm.
• honorarium
• honorariums
• honoraria
• usage fee
• usage charge
• usg fee
• use fee
• fee for use
• fee for usage
• incent*
• insent*
• earn a fee
• eam a fee
• charge for use
• charged for use
• charging for use
• charges for use
• licence fee
• license fee
• lisense fee
• “take cut”~2
• “takes cut”~2
• “took cut”~2
• “slice pie”~5
• “piece pie”~5
• “piece action”~5
• “slice action”~5
• -king
• -queen
• -prince
• -princess
9/23/2014 41
42. Ontology as a Query
But it can be slightly cumbersome to deal with directly in
that form
q ((+(std:%CapacityReports_% std:%DINCapacity_%) +(std:%ACMEEPPlant_% std:%ProductName_%)) (+(std:%ACMEPNPlant_%
std:%ProductName_%) +(std:%ProductiveCapability_% std:%CapacityReports_%)) (+(std:%CapacityCreep_%
std:%OperationsImprovement_% std:%CapacityExpansion_% std:%CapacityRestoration_%) +(std:%ACMEPNPlant_%
std:%ProductName_%)) (+(std:%EquipmentReplacement_% std:%FinishingColumn_%) +(std:%ACMEPNPlant_%
std:%ProductName_%)) (std:%Audit_% actor:%Audit_%) (+(std:%SettlementNegotiations_% std:%ContractNegotiations_% )
+(actor:%ACMEOutsideCounsel_% std:%ACMEOutsideCounsel_% actor:%ACME UBOutsideCounsel_%
std:%AcmeSubOutsideCounsel_% actor:%AcmeSub_% std:%AcmeSub_%)) (std:%FTC_% actor:%FTC_%)
((+subject:%ProductName_% +(std:swap std:"supply agreement" std:"exchange agreement" std:"agree to exchange")) std:"name
(About a quarter of its regular size)
9/23/2014 42
43. Ontology Pros & Cons
Identify acronyms
Normalize variants
Disambiguate terms
Identify overly broad keywords
Identify and correct keywords with errors
Create extensive libraries of ontologies
Can be used as a clustering method
Topics can appear in more than one languages
Reusable for different types of litigation, e.g. anti-trust,
product liability etc. (and for both offense and defense)
As with Keyword - word based
Labor intensive, upfront
9/23/2014 43
44. “Search” Terminology
Technology-Enhanced Review
Technology Assisted Review
Automated Review
Predictive Coding
• Process
• Workflow
Technology
People
• Subject Matter
• Review
• Feedback
• Privilege
• Production
Quality
Control
9/23/2014 44
45. Setup
Sample
Expert judges sample
Non-responsive
Responsive
Model learns
Model predicts
Responsive Non-responsive
Model categorizes all remaining documents
Repeat as needed
47. Technology Enhanced Review:
Speed, Predictable Costs, and Accuracy
Example from a real case
Priv by
High-Speed
Manual Review
Automate any portion of the review
Source
Data
Eliminate
Duplicates &
System Files
Non-Responsive
Isolation
ontologies
Responsive
by Technology
Enhanced
Review
(removed
another 7%)
NR by
Technology
Enhanced
Review
(removed
another 18%)
30%
30%
15%
22%
100%
3%
9/23/2014 47
48. Search Methodologies
Visualization
Measurement
Relationship
Analysis
documents with
causal or
sequential relationship
Social Network Analysis
relationships among relevant people
Clustering Ontology
similarity of
salient features
Ontology
generalized
generalized
words or phrases
words or phrases
specific exact words,
KKeeyywwoorrdd specific exact words
Keyword
Keyword specific exact words
proximity searches, stemming
Context
Concept
Content
9/23/2014 48
51. Search Methodologies
Visualization
Measurement
Relationship
Analysis
documents with
causal or
sequential relationship
Social Network Analysis
relationships among relevant people
Clustering Ontology
similarity of
salient features
Ontology
generalized
generalized
words or phrases
words or phrases
specific exact words,
KKeeyywwoorrdd specific exact words
Keyword
Keyword specific exact words
proximity searches, stemming
Context
Concept
Content
9/23/2014 51
53. Better Answers and Better Questions
When were customary work practices circumvented?
When did established norms of behavior change?
Who knew, or likely knew, what facts?
Who interacted with whom and how intimately?
Who was involved in what types of decisions or meetings?
Who are the real ‘insiders’?
What data is hidden or missing?
When were electronically documented conversations
“taken off line,” possibly in an attempt to avoid detection?
How did the importance of different actors change over time?
9/23/2014 53
54. Bear Stearns
Lower Bar For Fraud?
Two hedge fund managers
arrested
Charged with securities and
wire fraud, and one with
insider trading
Internal emails:
– “I'm fearful of these markets. ... As we discussed it may not be a
meltdown for the general economy but in our world it will be.”
– “I think we should close the funds now .”
External communications:
– “We are very comfortable with exactly where we are.”
– “The funds are performing exactly as they were designed to.”
9/23/2014 54
56. Analysis of Anomalous Communication Patterns
Unusual levels relative to a
particular type of activity
pop out
Color-coded graphs show
relative communication
densities for apples to
apples comparisons
9/23/2014 56
59. “Call Me” Events
Sequence Viewer used for analytics-driven review
9/23/2014 59
60. Search Risks
Failure to find responsive documents
Failure to recognize responsive documents
Failure to recognize privileged documents
Inconsistent treatment of documents (e.g.,
duplicates)
Failure to complete project in a timely manner
Sophisticated Tools
– Understand What They Do and Don’t Do Well
– Inform Yourself, Speak to References, Consultants
9/23/2014 60
61. Transparency of Process
Discussing Review Protocols
– Provide transparent, defensible, sophisticated search
based on document content
– Clustering, Ontologies, Analytics, and yes, sometimes
Keywords too
Develop search methodologies for each case
– Use technology experts in consultation with case / legal
experts
Results verifiable by Quality Control
– Defensible sampling
9/23/2014 61
62. Thank you!
Sonya L. Sigler
Vice President, Product Strategy
SFL Data
415-321-8385
sonya@sfldata.com
www.sfldata.com
9/23/2014 62
63. Review Protocol
≠ Agreeing to Search Terms
Data Culling (upfront or backend)
Search Methodologies - Continuum
– Keyword Positive List
– Ontologies
– Clustering
– Technology Enhanced Review
– Relationship Analysis
Quality Control Process & Procedures
Privilege Review, Sensitivities
Production Format & Timing
9/23/2014 63
64. Search
The Courts are Finally Starting to Catch up to
Technology
Making more aggressive rulings:
– Forcing attorneys to live with the results of bad
searches
– Sanctioning those who screw up, even if no allegation
of fraud
– Demanding repeatable,
demonstrable process – using
terms like “quality assurance”
9/23/2014 64
65. Search Under Scrutiny
Facciola’s Opinions - United States v. O’Keefe
“for lawyers and judges to dare opine that a certain
search term or terms would be more likely to produce
information than [other] search terms … is truly to go
where angels fear to tread.”
He has also suggested that litigants take a good look at
more advanced search methodologies, including the use
of computational linguistics and technology assisted
review
9/23/2014 65
66. Reasonableness of Search Methods
Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md., May 29, 2008).
"Common sense suggests that even a properly designed and executed
keyword search may prove to be over-inclusive or under-inclusive...the only
prudent way to test the reliability of the keyword search is to perform some
appropriate sampling."
“Selection of the appropriate search and information retrieval technique
requires careful advance planning by persons qualified to design effective
search methodology. The implementation of the methodology selected should
be tested for quality assurance; and the party selecting the methodology must
be prepared to explain the rationale for the method chosen to the court,
demonstrate that it is appropriate for the task, and show that it was properly
implemented.”
9/23/2014 66
67. From Pre-Discovery to Production Completeness
Henry v. Quicken Loans --> 26(f) consulting
– Lawyers agreed to keyword lists and process
– Ran own (unsanctioned) searches with expert
– Told to live with bad results, and pay for it
Qualcomm --> Smell Test; Dig Deeper
– In-house counsel (Qualcomm) v. Outside Counsel (Day Casebeer)
– Sanctions, Attorney Client-Privilege Problems
– Associate found docs and told they weren’t relevant; found out the
hard way that those and 230,000 other pages were relevant
Judge Rader’s Protocol in TX for Patent cases
– 5 custodians
– 5 search terms (can you say over broad…)
9/23/2014 67
68. Under-inclusive - Missing in Action
Missing abbreviations / acronyms / clippings:
– incentive stock option but not ISO
– Board of Directors but not BOD
– 1998 plan but not 98 plan
Missing inflectional variants:
– grant but not grants, granted, granting
Missing spellings or common misspellings:
– gray but not grey
– privileged but not priviliged, priviledged, privilidged,
priveliged, privelidged, priveledged, …
9/23/2014 68
69. Missing in Action II
Missing syntactic variants:
board of directors meeting
but not
meeting of the board
of directors
BOD meeting
board meeting
BOD mtg
board mtg
directors’ meeting
directors’mtg
mtg of the BOD
mtg of the directors
BOD meetings
board meetings
BOD mtgs
board mtgs
directors’ meetings
directors’ mtgs
mtgs of the BOD
mtgs of the directors
9/23/2014 69
70. Missing in Action III
Missing synonyms / paraphrases:
hire date but not start date
approved by Smith
but not
Smith’s approval
the approval of Smith
Smith’s ok
Smith’s go-ahead
Smith’s goahead
the go-ahead from
Smith
the goahead from
Smith
the nod from Smith
Smith’s signature
Smith’s sign-off
the sign-off of Smith
the signoff of Smith
9/23/2014 70
71. Missing in Action IV
As a keyword item, the address
101 E. Bergen Ave., Temple, CA 90200
does not match any of:
101 East Bergen Avenue
the Bergen site
the Temple location
our 90200 outlet
9/23/2014 71
72. Over-inclusive - Unwanted Extras
Options
Target: Sheila was granted 100,000 options at $10
Match: What are our options for lunch?
Match in a signature line:
Amanda Wacz
Acme Stock Options Administrator
Destroy
Target: destroy evidence
Match in a disclaimer: The information in this email, and any
attachments, may contain confidential and/or privileged
information and is intended solely for the use of the named
recipient(s). Any disclosure or dissemination in whatever form, by
anyone other than the recipient is strictly prohibited. If you have
received this transmission in error, please contact the sender
and destroy this message and any attachments. Thank you.
9/23/2014 72
73. Unwanted Extras II
alter*
Target: alter, alters, altered, altering
Matches: alternate, alternative, alternation, altercate,
altercation, alterably, …
grant
Target: stock option grant
Matches names: Grant Woods, Howard Grant
9/23/2014 73
74. Tuning an Ontology
Linguists briefed as reviewers
Linguists read the data
Linguists study complaint and other relevant
documents
Linguists analyze the search index
Legal Team provides input, feedback
9/23/2014 74
75. A Simple Linguistic Ontology
ROYALTY CONCEPT
– Royalty
– Commission
– Honorarium
– Usage Fee
– Slice of the Pie
9/23/2014 75
76. A Simple Pricing Concept
PRICING CONCEPT
– Purchase Order
– PO
– Dollar amount
– Invoice
9/23/2014 76
77. Adding Subjective Content
PRICING CONCEPT
– Purchase Order
– PO
– Dollar amount
– Invoice
– Cylinder
– Canister
– Bottle
9/23/2014 77
78. Ontology Usage
Identifying Misspellings, Slang, Nicknames, etc.
Variant Generation – help the user find what he
meant (names, words, suggestions)
– Buy* Buying, Buys, Bought, etc.
– Kenneth Lay, Ken Lay, klay, kenneth.lay
View variations in context to choose topics
Document segmentation – text blocks, signatures
Finding Words in Context, Frequency
at serious risk of losing 25
are certain risks inherent in 16
9/23/2014 78
79. Identifying misspellings, slang, etc
1. Match the index against electronic dictionary.
2. From the remaining material (not in dictionary), remove any
items that are merely numbers.
3. Find (in the ontologies) any words that are similar to what
remains.
4. Add the similar words to the ontology
This increases coverage (i.e., ensures
that we retrieve documents that
otherwise would have been missed)
9/23/2014 79
80. Variant Generation
Help the user find out
search for what he meant
Take names, numbers,
and other entities for
which the user wants to
search
Automatically generate
likely synonyms
9/23/2014 80
81. Variant Generation
Show the context of these variations, so the user can
evaluate them.
9/23/2014 81
82. Document Segmentation
Examples of signatures
Jean-Louis Koenig
President GGDA Region
MegaCorp International SA
Rue de Concours 2280
Bern, Switzerland
Robert Guilliam
Product Regulatory Affairs & Compliance
MegaCorp International
Neuchatel
Switzerland
Tél. +41 (31) 125 2366
Alberto Goreman
Manager Printing & Packaging, Eastern Region
+57 3 451 7195, alberto_goreman@megacorp.com
9/23/2014 82
83. Finding words in context
Phrase Total Instances
risks alienating some 37
at serious risk of losing 25
are certain risks inherent in 16
are at risk of running 15
it be risking anything by 15
difference a risk o why 14
and the risks inherent in 12
without assuming any risk 8
we could risk losing next 7
avoid transferring risk to the 5
requires taking risks and the 4
can t risk not living 3
and unknown risks and uncertainties 2
a potential risk that was 2
avoid transfering risk to the 2
This increases coverage AND precision
9/23/2014 83
84. Multi-Lingual Issues
Does language matter?
– Lucerne
– Luzerne
– Lucerna
These places were all the same city
Name of city not necessarily expressed in the same
language as rest of document
In Europe, many email threads and documents are
mixed language, and must be properly categorized as
such
9/23/2014 84
85. Automated Ontology Expansion Tools
Currently implemented expansion modules:
Spelling variants:
color >> colour, defense >> defence, labeled>> labelled
Lemmatization (recovering uninflected form):
walking >> walk, ate >> eat
Morphological variants:
eat >> eats, eating, eaten, ate
hablar >> hablo, hablas, habla, hablan, habláis, hablamos
Number expansion:
$2.5B >> two point five billion dollars
2,567 >> two thousand five hundred sixty seven
13 >> 13th, thirteenth
Name variants:
Elizabeth Van der Beek >> “Liz Van der Beek”, “Liz Vander Beek”, “Van
der Beek, Elizabeth”, “Beth Vanderbeek”, etc.
Email variants (mined from alias clusters file):
Elizabeth Van der Beek >> evanderbeek, liz.vanderbeek, vanderbeekl,
emvanderbeek, etc.
Abbreviations:
administrative project meeting >> admin project meeting, admin project
mtg, admin proj mtg, etc.
9/23/2014 85