SlideShare a Scribd company logo
“Triggers,” Preservation
           & Search

           June 2, 2012
           Georgetown Law

           Sonya L. Sigler




6/4/2012                              1
Overview
      Triggers & Preservation
      • What is it?
      • Why Does it Matter?
      Search
        Keyword Search
        Clustering
        Ontologies
        Technology Enhanced Review - Sampling
        Social Networking Analysis
        Relationship Analysis

6/4/2012                                        2
“Triggers” & Preservation
       What is a Trigger?
           – Litigation reasonably anticipated
           – Who decides
       Litigation Hold Continuum
           –   Established in hind sight
           –   Threat
           –   Letter about litigation
           –   Filing Suit
       Cases
           – Pippin, Zubulake, Pension Committee


6/4/2012                                           3
Pippins v. KPMG
       How much data to Preserve?
           – All hard drives (Pippins‟ position)
           – 100 Sample Hard drives (KPMG‟s position)
       To Cooperate or NOT to Cooperate?
       How Judges React to Lack of Cooperation




6/4/2012                                                4
Zubulake
       Litigation Holds
           – Cannot send a request into the ether
       Preservation
           Have to follow-up
             Take affirmative steps to monitor compliance
           In-house Counsel Duty
             Cannot leave it to employees discretion
           Document what was done




6/4/2012                                                    5
Pension Committee
       No intentional destruction of data
       Careless & indifferent
       No Latchkey Custodians (alone & unsupervised)
           – Identify Custodians
           – Monitor their efforts
           – Including former employees and third parties
       Proactive
       Consistent
       Reasonable Approach



6/4/2012                                                    6
Triggers




           When does a duty to preserve arise?




6/4/2012                                         7
What To Do?
       Who to include?
           – Not about data volume
           – Not about contact with underlying “litigation”
       Key Players (Zubulake opinions)
           – Likely to have relevant information
           – CEO, Board, Committees, employees, etc.
       Produce it from the Key Player (not others)
           – Nursing Home Pension Fund v. Oracle
           – Produce emails from the CEO (15) not others (1,650)



6/4/2012                                                           8
Spoliation
       Failure to Preserve
           – Didn‟t Ask
              • Right person
              • Right Place
           – Didn‟t follow up
       Destruction of Data
           – Intentional
           – Inadvertent destruction
       What can happen
           – Sanctions
           – Adverse Inferences

6/4/2012                                     9
Search
       How to Use it To Find Information
       How to Use it to Ignore Information
       When to use which search methodology




6/4/2012                                      10
Search - Data Assessment
       Where is the Data?
           – Data Mapping -
             databases, servers, desktops, laptops, IMs, smart
             phones, voicemail, other records
       Defining Process from Collection to Review to
       Production
       Collection Strategy, Process, Approach
           – Scope of collection: custodians, date ranges, topics
       Reports on the Data Processing
           – File types, encrypted files, de-duplication
             rates, password protected files, encrypted files, etc.
       Not Reasonably Accessible data
       Assessing Risk of Data Loss
6/4/2012                                                              11
Search - Case Assessment
       Who - Cast of Characters
       What - What the Heck Happened?
       Where - Where did it take place?
       When - What time period are we concerned with?
       How - fraud, antitrust violation, etc.
       WHY - What were the motives involved?

           Data Assessment ≠ Effective Case Assessment




6/4/2012                                                 12
Keyword Search Under Scrutiny
       United States v. O‟Keefe (Facciola)
           – Questioned lawyers‟ ability to decide which search terms are more likely to
             produce relevant information
           – Facciola has also suggested that litigants take a look at advanced search
             methodologies


       Victor Stanley, Inc. v. Creative Pipe, Inc. (Grimm)
           – Defensibility of process AND execution lies with the party relying upon the
             search protocol to meet their obligations which needs to be able to explain
             search rationale, appropriateness, and proper implementation
           – Advocates quality assurance, e.g. by sampling
           – Searches should be designed by a competent practitioner




6/4/2012                                                                                   13
Keyword Specific Case
     William A. Gross Construction Associates, Inc. v.
     American Manufacturers Mutual Insurance Company
     SDNY, Judge Andrew Peck
     Keyword list was in the thousands
     Use the actual data set and custodians to figure out
     keywords

     “This case is just the latest example of lawyers designing keyword
     searches in the dark, by the seat of the pants, without adequate
     (indeed, here, apparently without any) discussion with those who wrote
     the emails. Prior decisions from Magistrate Judges in the Baltimore-
     Washington Beltway have warned counsel of this problem, but the
     message has not gotten through to the Bar in this District.”
6/4/2012                                                                      14
$6M Keyword Mistake

       In re Fannie Mae Securities Litigation
       3rd Party - OFHEO
       DC Circuit - Judge David Tatel
       Attorney agreed to something he did NOT understand
       Long list of key terms
       Taxpayers suffered the consequence




6/4/2012                                                    15
What This Means



           • The Courts are finally
             catching up
           • Courts actively ruling on
             Standards of Care and
             Process
           • Lawyers are Getting Wise




6/4/2012                                    16
Case Law Effects on Discovery


           Defensibility of Review Process is now a focus
           –   Culling now can kill you later
           –   Cooperation is a hot topic
           –   Tussle between inside & outside counsel
           –   Beginning to see planning as a necessity

           Increased focus on Quality
           – Heightened involvement expected from corporate clients
             in the overall process
           – Cases pushing this, Qualcomm, Creative Pipe

6/4/2012                                                              17
What Else Is There?
       Effort to establish & codify uniform “Best Practices”
           – Quickly becoming roadmap for uneducated industry
           – Increasingly relied upon by judges as measure of reasonable or
             standard behavior
       Publications have addressed:
           –   Document retention & production
           –   Email management
           –   Search & Retrieval
           –   Protective orders & confidentiality
           –   ESI admissibility




6/4/2012                                                                      18
Getting to a Manageable Review Set

           Intake
                                                           Focus on
                    Duplicates
            Data      25%                            finding, reviewing &
            100%
                                                    using the “right” data,
                                 Junk/Spam/
                                    Porn            not just filtering data
                                    20%
                                  NR/Priv
                                   20%
                                                   Non-
                                                Responsive
                                                   20%
                                              Responsive     Produced
                                              & Priv 15%      12.25%


           These figures vary based upon the data set received

6/4/2012                                                                      19
Search Methodologies

                                       Visualization
                                       Measurement
                         Relationship
                           Analysis
                        documents with
                          causal or
                     sequential relationship
Context
                    Social Network Analysis
              relationships among relevant people
              relationships among relevant people
              Clustering
              Clustering              Ontology
                                      Ontology
Concept      similarity of
              similarity of          generalized
                                     generalized
            salient features
            salient features       words or phrases
                                   words or phrases
                                   specific exact words,
Content     Keyword
            Keyword                specific exact words
                                    specific exact words
                               proximity searches, stemming


6/4/2012                                                      20
Keyword Accuracy Example
    Keyword search reduced the
    document set by only 47%

    And 88% of the documents
    returned by keyword
    search were not responsive
    (Over-inclusive)




     8,553 responsive documents
     missed by keyword search
     (Almost 8% of responsive
     documents missed by
     keyword search - Under-inclusive)



6/4/2012                                     21
Myth
           Keyword Searching is the Way to Go

             If I agree to keyword terms, I am OK

            Keyword Search Cases
            Keyword replacement example
            Keyword substitution


             Missing in Action (Under-inclusive)
             Unwanted Extras (Over-inclusve)
             Multiple subject/persons (Disambiguate)


6/4/2012                                               22
Fact or Myth?

           Manual review by humans of large amounts of information
            is as accurate and complete as possible - perhaps even
             perfect - and constitutes the gold standard by which all
                          searches should be measured


      This is ‚The reigning Myth of ‘perfect’ retrieval using traditional
      means‛
                   Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery
                                                                    The Sedona Conference Journal (2007) p. 199


      Human beings retrieved less than 20% of the relevant documents
      when they believed they were retrieving over 75%
                                   An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System
                                                                                                 Blair & Maron (1985)




6/4/2012                                                                                                                23
Blair and Maron 1985
A classic study of retrieval effectiveness
– earlier studies were on unrealistically small collections
Studied an archive of documents for a legal suit
–   ~350,000 pages of text
–   40 queries
–   focus on high recall
–   Used IBM‟s STAIRS full-text system
Main Result:
– The system retrieved less than 20% of the relevant
  documents for a particular information need; lawyers
  thought they had 75%
But many queries had very high precision
Blair and Maron, cont.
How they estimated recall
– generated partially random samples of unseen documents
– had users (unaware these were random) judge them for
  relevance
Other results:
– two lawyers searches had similar performance
– lawyers recall was not much different from paralegal‟s
Blair and Maron, cont.
Why recall was low
– users can‟t foresee exact words and phrases that will
  indicate relevant documents
   • “accident” referred to by those responsible as:
   “event,” “incident,” “situation,” “problem,” …
   • differing technical terminology
   • slang, misspellings
– Perhaps the value of higher recall decreases as the
  number of relevant documents grows, so more detailed
  queries were not attempted once the users were satisfied
Keyword Search Summary
                 Pro                                  Con
      Word Stemming                        Over-inclusive
           –Hous* -                         –Disambiguate
           house, housemate, household    Under-inclusive
      Easy to use/explain/agree           Word must be present
      Familiar                            Hard to craft
                                          Ineffective with short
      Fast results                       messages, IMs




6/4/2012                                                           27
Keyword Truths
           Under-inclusive - missing relevant or important
           info
           Over-inclusive - costly to review
           “Reasonable Keyword Search” doesn‟t exist
           Effective keyword search is difficult/impossible
           – Index Data, Analyze Index
           – Suggest keywords or approach
           Keywords may not be appropriate for the data

            Keyword Search is ONE Tool in Your Arsenal

6/4/2012                                                      28
Keyword Accuracy Example
    Keyword search reduced the
    document set by only 47%

    And 88% of the documents
    returned by keyword
    search were not responsive
    (Over-inclusive)




     8,553 responsive documents
     missed by keyword search
     (Almost 8% of responsive
     documents missed by
     keyword search - Under-inclusive)



6/4/2012                                     29
Search Methodology Continuum
       Review Methodology - Decided Upfront
       Identify Issues in the Case
           – Formulate Queries and Approaches for Finding
             Responsive Documents
           – Formulate Relevancy and Responsiveness Guidelines
       Identify Primary Participants
       Select or Triage Documents for Review




6/4/2012                                                         30
Review Tools for Relevancy Assessment

       Keyword Searches, Culling
           – Slices of Data are Reviewed
       Categorization of Data
           – Entire Dataset is Categorized
           – Review Targeted Data
       Automated Review
           – Categorization of Dataset
           – Random Sampling (Statistically Significant)




6/4/2012                                                   31
Categorization of Data for Review
       Categorize Entire Data Set
           – Spam/Porn/System Files
           – Personal/Private Data
           – Non-relevant Business Data
       Business Data
           – Relevancy Assessment by Topic
           – Privilege Review
       Keyword, Topic Analysis - Overlap, Holes



6/4/2012                                            32
Search Methodologies

                                       Visualization
                                       Measurement
                         Relationship
                           Analysis
                        documents with
                          causal or
                     sequential relationship
Context
                    Social Network Analysis
              relationships among relevant people
              relationships among relevant people
              Clustering
              Clustering              Ontology
                                      Ontology
 Concept     similarity of
              similarity of          generalized
                                     generalized
            salient features
            salient features       words or phrases
                                   words or phrases
                                   specific exact words,
Content     Keyword
            Keyword                specific exact words
                                    specific exact words
                               proximity searches, stemming


6/4/2012                                                      33
Categorization Methods
       Statistical Methods (#s based)
           – Topic Clustering
              • Statistical Similarity
              • Counting #s of words, appearance together
           – Latent Semantic Indexing
           – Supervised v. Unsupervised Clustering
       Linguistic Methods (Word Based)
           – Keyword (Culling Method)
           – Ontologies




6/4/2012                                                    34
Clustering
  Clustering just means putting documents into groups that have
  something in common.
   Manually (that's what manual review is)
   Keyword Searches
   Ontologies (linguistic filters)
   Automated clustering (using technology)
           – Automated clustering by document type (all the Word
             documents go into one basket
           – Automated clustering by creation date
           – Automated clustering by Actor
           – Automated clustering by statistical similarity (statistical
             clustering)
           – ... and many other approaches



6/4/2012                                                                   35
Clustering -- “Options”
  1 Cluster or 4 Clusters
    Financial/energy
    trading options
    Email/computer
    menu-driven
    options
    Stock options
    (ISO's)
    The generic idea of
    an available choice of
    action

6/4/2012                                  36
Clustering
       Software implements statistical
       methods of finding groups of “similar”
       documents
           – “Similar” must be defined appropriately
             for the application
       Documents are categorized with very
       little effort by the user
       May help with document review
           – A single reviewer can look at similar
             documents together, produce
             consistent review decisions
           – Tight clustering can be used to detect
             “near duplicates” caused by OCR
             errors

6/4/2012                                               37
Clustering vs. queries
       Clustering is unpredictable compared to keywords or
       taxonomies
       The items that look very similar (to the clustering
       algorithm) may not actually be similar in ways that
       matter
           – Relevancy may depend upon fine legal distinctions
           – May vary in the same matter by subpoena and/or
             jurisdiction




6/4/2012                                                         38
Ontologies

           Implement ontologies for directed searches.
           –   Approach searching from a knowledge-representation viewpoint
           –   Field is 25 years old, lots of work done
           –   Advantages:
                • Disambiguate different meanings of the same word from their
                    context
                     More accurate
                • Encapsulate many ways of saying the same thing
                     More thorough
                • Search for concepts, not individual words
                     More intuitive, more reusable, and faster
           Can be combined with other methods (unsupervised
           clustering, discussions).


6/4/2012                                                                        39
Subjectivity

           GOOD WEATHER
           – Sun
           – Calm

           BAD WEATHER
           – Rain
           – Snow
           – Wind



6/4/2012                           40
A More Realistic Ontology
           ROYALTY CONCEPT
           •   royalty          •   charge for use
           •   royalties        •   charged for use
           •   rty
                                •   charging for use
           •   commission
           •   commissions      •   charges for use
           •   comm.            •   licence fee
           •   honorarium       •   license fee
           •   honorariums      •   lisense fee
           •   honoraria        •   “take cut”~2
           •   usage fee        •   “takes cut”~2
           •   usage charge     •   “took cut”~2
           •   usg fee          •   “slice pie”~5
           •   use fee
                                •   “piece pie”~5
           •   fee for use
           •   fee for usage    •   “piece action”~5
           •   incent*          •   “slice action”~5
           •   insent*          •   -king
           •   earn a fee       •   -queen
           •   eam a fee        •   -prince
                                •   -princess

6/4/2012                                               41
Ontology as a Query

      But it can be slightly cumbersome to deal with directly in
      that form
       q ((+(std:%CapacityReports_% std:%DINCapacity_%) +(std:%ACMEEPPlant_% std:%ProductName_%)) (+(std:%ACMEPNPlant_%
       std:%ProductName_%) +(std:%ProductiveCapability_% std:%CapacityReports_%)) (+(std:%CapacityCreep_%
       std:%OperationsImprovement_% std:%CapacityExpansion_% std:%CapacityRestoration_%) +(std:%ACMEPNPlant_%
       std:%ProductName_%)) (+(std:%EquipmentReplacement_% std:%FinishingColumn_%) +(std:%ACMEPNPlant_%
       std:%ProductName_%)) (std:%Audit_% actor:%Audit_%) (+(std:%SettlementNegotiations_% std:%ContractNegotiations_% )
       +(actor:%ACMEOutsideCounsel_% std:%ACMEOutsideCounsel_% actor:%ACME UBOutsideCounsel_%
       std:%AcmeSubOutsideCounsel_% actor:%AcmeSub_% std:%AcmeSub_%)) (std:%FTC_% actor:%FTC_%)
       ((+subject:%ProductName_% +(std:swap std:"supply agreement" std:"exchange agreement" std:"agree to exchange")) std:"name

                                             (About a quarter of its regular size)




6/4/2012                                                                                                                          42
Ontology Pros & Cons
       Identify acronyms
       Normalize variants
       Disambiguate terms
       Identify overly broad keywords
       Identify and correct keywords with errors
       Create extensive libraries of ontologies
       Can be used as a clustering method
       Topics can appear in more than one languages
       Reusable for different types of litigation, e.g. anti-trust,
       product liability etc. (and for both offense and defense)

       As with Keyword - word based
       Labor intensive, upfront
6/4/2012                                                              43
“Search” Terminology
       Technology-Enhanced Review
       Technology Assisted Review
       Automated Review
       Predictive Coding


                                   People
           • Process                           • Privilege
           • Workflow       • Subject Matter   • Production
                            • Review
                            • Feedback
                                                     Quality
               Technology
                                                     Control



6/4/2012                                                       44
Setup




                                           Sample




     Responsive                                             Non-
                            Expert judges sample            responsive



Repeat as needed

                                              Model learns
                                              Model predicts


      Responsive                                      Non-responsive

                   Model categorizes all remaining documents
Automated Review Methodology
Technology Enhanced Review:
           Speed, Predictable Costs, and Accuracy
       Automate any portion of the review

              Source    Eliminate
               Data    Duplicates &
                       System Files


             100%                 Non-Responsive
                       30%           Isolation         Example from a real case
                                    ontologies


                                                 NR by
                                      30%     Technology  Responsive
                                               Enhanced by Technology
                                                Review     Enhanced
                                               (removed     Review       Priv by
                                             another 18%)  (removed    High-Speed
                                                          another 7%) Manual Review

                                              22%                         3%
                                                        15%


6/4/2012                                                                              47
Search Methodologies

                                       Visualization
                                       Measurement
                         Relationship
                           Analysis
                        documents with
                          causal or
                     sequential relationship
 Context
                    Social Network Analysis
              relationships among relevant people
              relationships among relevant people
              Clustering
              Clustering              Ontology
                                      Ontology
 Concept     similarity of
              similarity of          generalized
                                     generalized
            salient features
            salient features       words or phrases
                                   words or phrases
                                   specific exact words,
 Content    Keyword
            Keyword                specific exact words
                                    specific exact words
                               proximity searches, stemming


6/4/2012                                                      48
From Document Analysis to
             Social Network Analysis




6/4/2012                               49
From Social Network Analysis
                 to Discussions




6/4/2012                                  50
Search Methodologies

                                       Visualization
                                       Measurement
                         Relationship
                           Analysis
                        documents with
                          causal or
                     sequential relationship
Context
                    Social Network Analysis
              relationships among relevant people
              relationships among relevant people
              Clustering
              Clustering              Ontology
                                      Ontology
Concept      similarity of
              similarity of          generalized
                                     generalized
            salient features
            salient features       words or phrases
                                   words or phrases
                                   specific exact words,
Content     Keyword
            Keyword                specific exact words
                                    specific exact words
                               proximity searches, stemming


6/4/2012                                                      51
Analytics are Based on the Model
                  and on Discussions




                                     Analytics
6/4/2012                                         52
Better Answers and Better Questions
       When were customary work practices circumvented?
       When did established norms of behavior change?
       Who knew, or likely knew, what facts?
       Who interacted with whom and how intimately?
       Who was involved in what types of decisions or meetings?
       Who are the real „insiders‟?
       What data is hidden or missing?
       When were electronically documented conversations
       “taken off line,” possibly in an attempt to avoid detection?
       How did the importance of different actors change over time?



6/4/2012                                                              53
Bear Stearns
                                Lower Bar For Fraud?

           Two hedge fund managers
           arrested
           Charged with securities and
           wire fraud, and one with
           insider trading
           Internal emails:
           – “I'm fearful of these markets. ... As we discussed it may not be a
             meltdown for the general economy but in our world it will be.”
           – “I think we should close the funds now .”
           External communications:
           – “We are very comfortable with exactly where we are.”
           – “The funds are performing exactly as they were designed to.”
6/4/2012                                                                          54
Sentiment Analysis Visualization




6/4/2012                                      55
Analysis of Anomalous Communication Patterns




           Unusual levels relative to a
           particular type of activity
           pop out

           Color-coded graphs show
           relative communication
           densities for apples to
           apples comparisons



6/4/2012                                            56
Spread of Information




6/4/2012                           57
Emotive Tone
           Whistle-blower Scenario




6/4/2012                             58
“Call Me” Events
           Sequence Viewer used for analytics-driven review




6/4/2012                                                      59
Search Risks
       Failure to find responsive documents
       Failure to recognize responsive documents
       Failure to recognize privileged documents
       Inconsistent treatment of documents
       (e.g., duplicates)
       Failure to complete project in a timely manner

       Sophisticated Tools
           – Understand What They Do and Don‟t Do Well
           – Inform Yourself, Speak to References, Consultants

6/4/2012                                                         60
Transparency of Process
       Discussing Review Protocols
           – Provide transparent, defensible, sophisticated search
             based on document content
           – Clustering, Ontologies, Analytics, and yes, sometimes
             Keywords too
       Develop search methodologies for each case
           – Use technology experts in consultation with case / legal
             experts
       Results verifiable by Quality Control
           – Defensible sampling



6/4/2012                                                                61
Thank you!




                      Sonya L. Sigler
             Vice President, Product Strategy
                         SFL Data
                       415-321-8385
                   sonya@sfldata.com
                     www.sfldata.com




6/4/2012                                        62
Review Protocol
       ≠ Agreeing to Search Terms
       Data Culling (upfront or backend)
       Search Methodologies - Continuum
           –   Keyword Positive List
           –   Ontologies
           –   Clustering
           –   Technology Enhanced Review
           –   Relationship Analysis
       Quality Control Process & Procedures
       Privilege Review, Sensitivities
       Production Format & Timing
6/4/2012                                      63
Search
       The Courts are Finally Starting to Catch up to
       Technology
       Making more aggressive rulings:
           – Forcing attorneys to live with the results of bad
             searches
           – Sanctioning those who screw up, even if no allegation
             of fraud
           – Demanding repeatable,
             demonstrable process – using
             terms like “quality assurance”



6/4/2012                                                             64
Search Under Scrutiny
           Facciola’s Opinions - United States v. O’Keefe

        “for lawyers and judges to dare opine that a certain
       search term or terms would be more likely to produce
       information than [other] search terms … is truly to go
                    where angels fear to tread.”

            He has also suggested that litigants take a good look at
            more advanced search methodologies, including the use
            of computational linguistics and technology assisted
            review




6/4/2012                                                               65
Reasonableness of Search Methods
       Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md., May 29, 2008).




           "Common sense suggests that even a properly designed and executed
           keyword search may prove to be over-inclusive or under-inclusive...the only
           prudent way to test the reliability of the keyword search is to perform some
           appropriate sampling."

           “Selection of the appropriate search and information retrieval technique
           requires careful advance planning by persons qualified to design effective
           search methodology. The implementation of the methodology selected should
           be tested for quality assurance; and the party selecting the methodology must
           be prepared to explain the rationale for the method chosen to the court,
           demonstrate that it is appropriate for the task, and show that it was properly
           implemented.”




6/4/2012                                                                                      66
From Pre-Discovery to Production Completeness

       Henry v. Quicken Loans --> 26(f) consulting
           – Lawyers agreed to keyword lists and process
           – Ran own (unsanctioned) searches with expert
           – Told to live with bad results, and pay for it
       Qualcomm --> Smell Test; Dig Deeper
           – In-house counsel (Qualcomm) v. Outside Counsel (Day Casebeer)
           – Sanctions, Attorney Client-Privilege Problems
           – Associate found docs and told they weren‟t relevant; found out the
             hard way that those and 230,000 other pages were relevant
       Judge Rader‟s Protocol in TX for Patent cases
           – 5 custodians
           – 5 search terms (can you say over broad…)

6/4/2012                                                                          67
Under-inclusive - Missing in Action
           Missing abbreviations / acronyms / clippings:
            – incentive stock option but not ISO

            – Board of Directors but not BOD

            – 1998 plan but not 98 plan



           Missing inflectional variants:
            – grant but not grants, granted, granting



           Missing spellings or common misspellings:
            – gray but not grey

            – privileged but not
               priviliged, priviledged, privilidged, priveliged, privelidged, pri
               veledged, …
6/4/2012                                                                            68
Missing in Action II

           Missing syntactic variants:
           board of directors meeting
             but not
                  meeting of the board   mtg of the directors
                   of directors          BOD meetings
                  BOD meeting            board meetings
                  board meeting          BOD mtgs
                  BOD mtg                board mtgs
                  board mtg              directors’ meetings
                  directors’ meeting     directors’ mtgs
                  directors’mtg          mtgs of the BOD
                  mtg of the BOD         mtgs of the directors
6/4/2012                                                         69
Missing in Action III

           Missing synonyms / paraphrases:

                 hire date but notstart date

                 approved by Smith

              but not

                 Smith’s approval          the goahead from
                 the approval of Smith       Smith
                 Smith’s ok                the nod from Smith
                 Smith’s go-ahead          Smith’s signature
                 Smith’s goahead           Smith’s sign-off
                 the go-ahead from         the sign-off of Smith
                   Smith                   the signoff of Smith
6/4/2012                                                           70
Missing in Action IV
           As a keyword item, the address
             101 E. Bergen Ave., Temple, CA 90200
             does not match any of:
             101 East Bergen Avenue

                 the Bergen site

                 the Temple location

                 our 90200 outlet




6/4/2012                                            71
Over-inclusive - Unwanted Extras
           Options

           Target: Sheila was granted 100,000 options at $10
            Match: What are our options for lunch?
            Match in a signature line:
                    Amanda Wacz
                    Acme Stock Options Administrator
           Destroy
            Target:destroyevidence
            Match in a disclaimer: The information in this email, and any
              attachments, may contain confidential and/or privileged
              information and is intended solely for the use of the named
              recipient(s). Any disclosure or dissemination in whatever form, by
              anyone other than the recipient is strictly prohibited. If you have
              received this transmission in error, please contact the sender
              and destroy this message and any attachments. Thank you.
6/4/2012                                                                            72
Unwanted Extras II
       alter*

       Target: alter, alters, altered, altering
  Matches:
       alternate, alternative, alternation, altercate, altercation, alt
       erably, …


       grant

    Target:stock optiongrant
  Matches names:GrantWoods, HowardGrant
6/4/2012                                                              73
Tuning an Ontology

           Linguists briefed as reviewers
           Linguists read the data
           Linguists study complaint and other relevant
           documents
           Linguists analyze the search index
           Legal Team provides input, feedback




6/4/2012                                                  74
A Simple Linguistic Ontology


           ROYALTY CONCEPT
            –   Royalty
            –   Commission
            –   Honorarium
            –   Usage Fee
            –   Slice of the Pie




6/4/2012                                  75
A Simple Pricing Concept

           PRICING CONCEPT
           –   Purchase Order
           –   PO
           –   Dollar amount
           –   Invoice




6/4/2012                               76
Adding Subjective Content


           PRICING CONCEPT
           –   Purchase Order
           –   PO
           –   Dollar amount
           –   Invoice
           –   Cylinder
           –   Canister
           –   Bottle



6/4/2012                               77
Ontology Usage
           Identifying Misspellings, Slang, Nicknames, etc.
           Variant Generation – help the user find what he
           meant (names, words, suggestions)
           – Buy* Buying, Buys, Bought, etc.
           – Kenneth Lay, Ken Lay, klay, kenneth.lay
           View variations in context to choose topics
           Document segmentation – text blocks, signatures
           Finding Words in Context, Frequency
           at serious risk of losing        25
           are certain risks inherent in    16




6/4/2012                                                      78
Identifying misspellings, slang, etc


  1.       Match the index against electronic dictionary.
  2.       From the remaining material (not in dictionary), remove any
           items that are merely numbers.
  3.       Find (in the ontologies) any words that are similar to what
           remains.
  4.       Add the similar words to the ontology



           This increases coverage (i.e., ensures
           that we retrieve documents that
           otherwise would have been missed)
6/4/2012                                                                 79
Variant Generation

           Help the user find out
           search for what he meant



                                       Take
                                       names, numbers, and
                                       other entities for which
                                       the user wants to search
                                       Automatically generate
                                       likely synonyms



6/4/2012                                                          80
Variant Generation
  Show the context of these variations, so the user can evaluate
    them.




6/4/2012                                                           81
Document Segmentation
                    Examples of signatures
  Jean-Louis Koenig
  President GGDA Region
  MegaCorp International SA
                           Robert Guilliam
  Rue de Concours 2280
                           Product Regulatory Affairs &Compliance
  Bern, Switzerland
                           MegaCorp International
                           Neuchatel
                           Switzerland
                           Tél. +41 (31) 125 2366

     Alberto Goreman
     Manager Printing &Packaging, Eastern Region
     +57 3 451 7195, alberto_goreman@megacorp.com

6/4/2012                                                            82
Finding words in context
             Phrase                                  Total Instances
             risks alienating some                   37
             at serious risk of losing               25
             are certain risks inherent in           16
             are at risk of running                  15
             it be risking anything by               15
             difference a risk o why                 14
             and the risks inherent in               12
             without assuming any risk                8
             we could risk losing next           7
             avoid transferring risk to the           5
             requires taking risks and the            4
             can t risk not living                    3
             and unknown risks and uncertainties      2
             a potential risk that was                2
             avoid transfering risk to the            2

           This increases coverage AND precision
6/4/2012                                                               83
Multi-Lingual Issues
       Does language matter?
           – Lucerne
           – Luzerne
           – Lucerna
           These places were all the same city
       Name of city not necessarily expressed in the same
       language as rest of document
       In Europe, many email threads and documents are
       mixed language, and must be properly categorized as
       such


6/4/2012                                                     84
Automated Ontology Expansion Tools
           Currently implemented expansion modules:
              Spelling variants:
              color>>colour, defense>>defence, labeled>>labelled
              Lemmatization (recovering uninflected form):
              walking>>walk, ate>>eat
              Morphological variants:
              eat>>eats, eating, eaten, ate
              hablar>>hablo, hablas, habla, hablan, habláis, hablamos
              Number expansion:
              $2.5B>>two point five billion dollars
              2,567>>two thousand five hundred sixty seven
              13>>13th, thirteenth
              Name variants:
              Elizabeth Van der Beek>>“Liz Van der Beek”, “Liz Vander Beek”, “Van der
              Beek, Elizabeth”, “Beth Vanderbeek”, etc.
              Email variants (mined from alias clusters file):
              Elizabeth Van der
              Beek>>evanderbeek, liz.vanderbeek, vanderbeekl, emvanderbeek, etc.
              Abbreviations:
              administrative project meeting>>admin project meeting, admin project
              mtg, admin proj mtg, etc.

6/4/2012                                                                                85

More Related Content

Similar to Georgetown Law Guest Lecture 2012 6 2

Georgetown lecture 2012 6 2 full
Georgetown lecture 2012 6 2 fullGeorgetown lecture 2012 6 2 full
Georgetown lecture 2012 6 2 full
Sonya Sigler
 
Internal Investigation 20110315 1
Internal Investigation 20110315 1Internal Investigation 20110315 1
Internal Investigation 20110315 1
Mayer Brown LLP
 
Organizational Behaviour Stephen Robbins 14Ed. Chapter 6
Organizational Behaviour Stephen Robbins 14Ed. Chapter 6Organizational Behaviour Stephen Robbins 14Ed. Chapter 6
Organizational Behaviour Stephen Robbins 14Ed. Chapter 6
Waqas Ahmad
 
Ob 14e 6 perception and individual decision making
Ob 14e 6 perception and individual decision makingOb 14e 6 perception and individual decision making
Ob 14e 6 perception and individual decision making
Engr Razaque
 
Amcto presentation final
Amcto presentation finalAmcto presentation final
Amcto presentation finalDan Michaluk
 
Defining a Legal Strategy ... The Value in Early Case Assessment
Defining a Legal Strategy ... The Value in Early Case AssessmentDefining a Legal Strategy ... The Value in Early Case Assessment
Defining a Legal Strategy ... The Value in Early Case Assessment
Aubrey Owens
 
Making decisions for growth
Making decisions for growthMaking decisions for growth
Making decisions for growth
Kenneth Taylor
 
1588416689-ch-5-1.ppt
1588416689-ch-5-1.ppt1588416689-ch-5-1.ppt
1588416689-ch-5-1.ppt
Manjulasingh17
 
OB11_05st_PerceptionandIndividualDecisionMaking.ppt
OB11_05st_PerceptionandIndividualDecisionMaking.pptOB11_05st_PerceptionandIndividualDecisionMaking.ppt
OB11_05st_PerceptionandIndividualDecisionMaking.ppt
tellasaby1
 
Practical Legacy Data Remediation - Redgrave LLP
Practical Legacy Data Remediation - Redgrave LLPPractical Legacy Data Remediation - Redgrave LLP
Practical Legacy Data Remediation - Redgrave LLP
Redgrave LLP
 
Practical Legacy Data Remediation - Redgrave LLP
Practical Legacy Data Remediation - Redgrave LLPPractical Legacy Data Remediation - Redgrave LLP
Practical Legacy Data Remediation - Redgrave LLP
Redgrave LLP
 
The Sedona Canada Panel on Privacy and E-Discovery
The Sedona Canada Panel on Privacy and E-DiscoveryThe Sedona Canada Panel on Privacy and E-Discovery
The Sedona Canada Panel on Privacy and E-Discovery
Dan Michaluk
 
Diamond Datascram Decimated
Diamond Datascram DecimatedDiamond Datascram Decimated
Diamond Datascram Decimated
Polsinelli PC
 
Translating Geek To Attorneys It Security
Translating Geek To Attorneys It SecurityTranslating Geek To Attorneys It Security
Translating Geek To Attorneys It SecurityCTIN
 
Rsearch methodology
Rsearch methodologyRsearch methodology
Rsearch methodology
neeann24
 
Cyber Summit 2016: Privacy Issues in Big Data Sharing and Reuse
Cyber Summit 2016: Privacy Issues in Big Data Sharing and ReuseCyber Summit 2016: Privacy Issues in Big Data Sharing and Reuse
Cyber Summit 2016: Privacy Issues in Big Data Sharing and Reuse
Cybera Inc.
 
AHIMA 2013 Presentation - LHR/LMR
AHIMA 2013 Presentation - LHR/LMRAHIMA 2013 Presentation - LHR/LMR
AHIMA 2013 Presentation - LHR/LMRDavid Kearney
 
What is in store for e-discovery in 2015?
What is in store for e-discovery in 2015?What is in store for e-discovery in 2015?
What is in store for e-discovery in 2015?
Logikcull.com
 

Similar to Georgetown Law Guest Lecture 2012 6 2 (20)

Georgetown lecture 2012 6 2 full
Georgetown lecture 2012 6 2 fullGeorgetown lecture 2012 6 2 full
Georgetown lecture 2012 6 2 full
 
Internal Investigation 20110315 1
Internal Investigation 20110315 1Internal Investigation 20110315 1
Internal Investigation 20110315 1
 
Organizational Behaviour Stephen Robbins 14Ed. Chapter 6
Organizational Behaviour Stephen Robbins 14Ed. Chapter 6Organizational Behaviour Stephen Robbins 14Ed. Chapter 6
Organizational Behaviour Stephen Robbins 14Ed. Chapter 6
 
Ob 14e 6 perception and individual decision making
Ob 14e 6 perception and individual decision makingOb 14e 6 perception and individual decision making
Ob 14e 6 perception and individual decision making
 
Amcto presentation final
Amcto presentation finalAmcto presentation final
Amcto presentation final
 
Defining a Legal Strategy ... The Value in Early Case Assessment
Defining a Legal Strategy ... The Value in Early Case AssessmentDefining a Legal Strategy ... The Value in Early Case Assessment
Defining a Legal Strategy ... The Value in Early Case Assessment
 
EDI 2009 Hot Topics In Corporate E-Discovery-Risk Managment and Cost Control
EDI 2009 Hot Topics In Corporate E-Discovery-Risk Managment and Cost ControlEDI 2009 Hot Topics In Corporate E-Discovery-Risk Managment and Cost Control
EDI 2009 Hot Topics In Corporate E-Discovery-Risk Managment and Cost Control
 
Making decisions for growth
Making decisions for growthMaking decisions for growth
Making decisions for growth
 
1588416689-ch-5-1.ppt
1588416689-ch-5-1.ppt1588416689-ch-5-1.ppt
1588416689-ch-5-1.ppt
 
OB11_05st_PerceptionandIndividualDecisionMaking.ppt
OB11_05st_PerceptionandIndividualDecisionMaking.pptOB11_05st_PerceptionandIndividualDecisionMaking.ppt
OB11_05st_PerceptionandIndividualDecisionMaking.ppt
 
Practical Legacy Data Remediation - Redgrave LLP
Practical Legacy Data Remediation - Redgrave LLPPractical Legacy Data Remediation - Redgrave LLP
Practical Legacy Data Remediation - Redgrave LLP
 
Practical Legacy Data Remediation - Redgrave LLP
Practical Legacy Data Remediation - Redgrave LLPPractical Legacy Data Remediation - Redgrave LLP
Practical Legacy Data Remediation - Redgrave LLP
 
The Sedona Canada Panel on Privacy and E-Discovery
The Sedona Canada Panel on Privacy and E-DiscoveryThe Sedona Canada Panel on Privacy and E-Discovery
The Sedona Canada Panel on Privacy and E-Discovery
 
Diamond Datascram Decimated
Diamond Datascram DecimatedDiamond Datascram Decimated
Diamond Datascram Decimated
 
Translating Geek To Attorneys It Security
Translating Geek To Attorneys It SecurityTranslating Geek To Attorneys It Security
Translating Geek To Attorneys It Security
 
Rsearch methodology
Rsearch methodologyRsearch methodology
Rsearch methodology
 
Cyber Summit 2016: Privacy Issues in Big Data Sharing and Reuse
Cyber Summit 2016: Privacy Issues in Big Data Sharing and ReuseCyber Summit 2016: Privacy Issues in Big Data Sharing and Reuse
Cyber Summit 2016: Privacy Issues in Big Data Sharing and Reuse
 
AHIMA 2013 Presentation - LHR/LMR
AHIMA 2013 Presentation - LHR/LMRAHIMA 2013 Presentation - LHR/LMR
AHIMA 2013 Presentation - LHR/LMR
 
Decsion making
Decsion makingDecsion making
Decsion making
 
What is in store for e-discovery in 2015?
What is in store for e-discovery in 2015?What is in store for e-discovery in 2015?
What is in store for e-discovery in 2015?
 

More from Sonya Sigler

2013 3 27 TAR Webinar Part 4 Getting Started Sigler
2013 3 27 TAR Webinar Part 4 Getting Started Sigler2013 3 27 TAR Webinar Part 4 Getting Started Sigler
2013 3 27 TAR Webinar Part 4 Getting Started Sigler
Sonya Sigler
 
2013 7 24 TAR Webinar 5 Tips & Myths Sigler
2013 7 24 TAR Webinar 5 Tips & Myths Sigler2013 7 24 TAR Webinar 5 Tips & Myths Sigler
2013 7 24 TAR Webinar 5 Tips & Myths Sigler
Sonya Sigler
 
2012 6 27 TAR Webinar Part 1 Sigler
2012 6 27 TAR Webinar Part 1 Sigler2012 6 27 TAR Webinar Part 1 Sigler
2012 6 27 TAR Webinar Part 1 Sigler
Sonya Sigler
 
2012 11 7 TAR Webinar Part 3 Sigler
2012 11 7 TAR Webinar Part 3 Sigler2012 11 7 TAR Webinar Part 3 Sigler
2012 11 7 TAR Webinar Part 3 Sigler
Sonya Sigler
 
2012 8 29 TAR Webinar Part 2 Sigler
2012 8 29 TAR Webinar Part 2 Sigler2012 8 29 TAR Webinar Part 2 Sigler
2012 8 29 TAR Webinar Part 2 Sigler
Sonya Sigler
 
SF Women in eDiscovery Sept 2011
SF Women in eDiscovery Sept 2011SF Women in eDiscovery Sept 2011
SF Women in eDiscovery Sept 2011
Sonya Sigler
 

More from Sonya Sigler (6)

2013 3 27 TAR Webinar Part 4 Getting Started Sigler
2013 3 27 TAR Webinar Part 4 Getting Started Sigler2013 3 27 TAR Webinar Part 4 Getting Started Sigler
2013 3 27 TAR Webinar Part 4 Getting Started Sigler
 
2013 7 24 TAR Webinar 5 Tips & Myths Sigler
2013 7 24 TAR Webinar 5 Tips & Myths Sigler2013 7 24 TAR Webinar 5 Tips & Myths Sigler
2013 7 24 TAR Webinar 5 Tips & Myths Sigler
 
2012 6 27 TAR Webinar Part 1 Sigler
2012 6 27 TAR Webinar Part 1 Sigler2012 6 27 TAR Webinar Part 1 Sigler
2012 6 27 TAR Webinar Part 1 Sigler
 
2012 11 7 TAR Webinar Part 3 Sigler
2012 11 7 TAR Webinar Part 3 Sigler2012 11 7 TAR Webinar Part 3 Sigler
2012 11 7 TAR Webinar Part 3 Sigler
 
2012 8 29 TAR Webinar Part 2 Sigler
2012 8 29 TAR Webinar Part 2 Sigler2012 8 29 TAR Webinar Part 2 Sigler
2012 8 29 TAR Webinar Part 2 Sigler
 
SF Women in eDiscovery Sept 2011
SF Women in eDiscovery Sept 2011SF Women in eDiscovery Sept 2011
SF Women in eDiscovery Sept 2011
 

Recently uploaded

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 

Recently uploaded (20)

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 

Georgetown Law Guest Lecture 2012 6 2

  • 1. “Triggers,” Preservation & Search June 2, 2012 Georgetown Law Sonya L. Sigler 6/4/2012 1
  • 2. Overview Triggers & Preservation • What is it? • Why Does it Matter? Search Keyword Search Clustering Ontologies Technology Enhanced Review - Sampling Social Networking Analysis Relationship Analysis 6/4/2012 2
  • 3. “Triggers” & Preservation What is a Trigger? – Litigation reasonably anticipated – Who decides Litigation Hold Continuum – Established in hind sight – Threat – Letter about litigation – Filing Suit Cases – Pippin, Zubulake, Pension Committee 6/4/2012 3
  • 4. Pippins v. KPMG How much data to Preserve? – All hard drives (Pippins‟ position) – 100 Sample Hard drives (KPMG‟s position) To Cooperate or NOT to Cooperate? How Judges React to Lack of Cooperation 6/4/2012 4
  • 5. Zubulake Litigation Holds – Cannot send a request into the ether Preservation Have to follow-up Take affirmative steps to monitor compliance In-house Counsel Duty Cannot leave it to employees discretion Document what was done 6/4/2012 5
  • 6. Pension Committee No intentional destruction of data Careless & indifferent No Latchkey Custodians (alone & unsupervised) – Identify Custodians – Monitor their efforts – Including former employees and third parties Proactive Consistent Reasonable Approach 6/4/2012 6
  • 7. Triggers When does a duty to preserve arise? 6/4/2012 7
  • 8. What To Do? Who to include? – Not about data volume – Not about contact with underlying “litigation” Key Players (Zubulake opinions) – Likely to have relevant information – CEO, Board, Committees, employees, etc. Produce it from the Key Player (not others) – Nursing Home Pension Fund v. Oracle – Produce emails from the CEO (15) not others (1,650) 6/4/2012 8
  • 9. Spoliation Failure to Preserve – Didn‟t Ask • Right person • Right Place – Didn‟t follow up Destruction of Data – Intentional – Inadvertent destruction What can happen – Sanctions – Adverse Inferences 6/4/2012 9
  • 10. Search How to Use it To Find Information How to Use it to Ignore Information When to use which search methodology 6/4/2012 10
  • 11. Search - Data Assessment Where is the Data? – Data Mapping - databases, servers, desktops, laptops, IMs, smart phones, voicemail, other records Defining Process from Collection to Review to Production Collection Strategy, Process, Approach – Scope of collection: custodians, date ranges, topics Reports on the Data Processing – File types, encrypted files, de-duplication rates, password protected files, encrypted files, etc. Not Reasonably Accessible data Assessing Risk of Data Loss 6/4/2012 11
  • 12. Search - Case Assessment Who - Cast of Characters What - What the Heck Happened? Where - Where did it take place? When - What time period are we concerned with? How - fraud, antitrust violation, etc. WHY - What were the motives involved? Data Assessment ≠ Effective Case Assessment 6/4/2012 12
  • 13. Keyword Search Under Scrutiny United States v. O‟Keefe (Facciola) – Questioned lawyers‟ ability to decide which search terms are more likely to produce relevant information – Facciola has also suggested that litigants take a look at advanced search methodologies Victor Stanley, Inc. v. Creative Pipe, Inc. (Grimm) – Defensibility of process AND execution lies with the party relying upon the search protocol to meet their obligations which needs to be able to explain search rationale, appropriateness, and proper implementation – Advocates quality assurance, e.g. by sampling – Searches should be designed by a competent practitioner 6/4/2012 13
  • 14. Keyword Specific Case William A. Gross Construction Associates, Inc. v. American Manufacturers Mutual Insurance Company SDNY, Judge Andrew Peck Keyword list was in the thousands Use the actual data set and custodians to figure out keywords “This case is just the latest example of lawyers designing keyword searches in the dark, by the seat of the pants, without adequate (indeed, here, apparently without any) discussion with those who wrote the emails. Prior decisions from Magistrate Judges in the Baltimore- Washington Beltway have warned counsel of this problem, but the message has not gotten through to the Bar in this District.” 6/4/2012 14
  • 15. $6M Keyword Mistake In re Fannie Mae Securities Litigation 3rd Party - OFHEO DC Circuit - Judge David Tatel Attorney agreed to something he did NOT understand Long list of key terms Taxpayers suffered the consequence 6/4/2012 15
  • 16. What This Means • The Courts are finally catching up • Courts actively ruling on Standards of Care and Process • Lawyers are Getting Wise 6/4/2012 16
  • 17. Case Law Effects on Discovery Defensibility of Review Process is now a focus – Culling now can kill you later – Cooperation is a hot topic – Tussle between inside & outside counsel – Beginning to see planning as a necessity Increased focus on Quality – Heightened involvement expected from corporate clients in the overall process – Cases pushing this, Qualcomm, Creative Pipe 6/4/2012 17
  • 18. What Else Is There? Effort to establish & codify uniform “Best Practices” – Quickly becoming roadmap for uneducated industry – Increasingly relied upon by judges as measure of reasonable or standard behavior Publications have addressed: – Document retention & production – Email management – Search & Retrieval – Protective orders & confidentiality – ESI admissibility 6/4/2012 18
  • 19. Getting to a Manageable Review Set Intake Focus on Duplicates Data 25% finding, reviewing & 100% using the “right” data, Junk/Spam/ Porn not just filtering data 20% NR/Priv 20% Non- Responsive 20% Responsive Produced & Priv 15% 12.25% These figures vary based upon the data set received 6/4/2012 19
  • 20. Search Methodologies Visualization Measurement Relationship Analysis documents with causal or sequential relationship Context Social Network Analysis relationships among relevant people relationships among relevant people Clustering Clustering Ontology Ontology Concept similarity of similarity of generalized generalized salient features salient features words or phrases words or phrases specific exact words, Content Keyword Keyword specific exact words specific exact words proximity searches, stemming 6/4/2012 20
  • 21. Keyword Accuracy Example Keyword search reduced the document set by only 47% And 88% of the documents returned by keyword search were not responsive (Over-inclusive) 8,553 responsive documents missed by keyword search (Almost 8% of responsive documents missed by keyword search - Under-inclusive) 6/4/2012 21
  • 22. Myth Keyword Searching is the Way to Go If I agree to keyword terms, I am OK Keyword Search Cases Keyword replacement example Keyword substitution Missing in Action (Under-inclusive) Unwanted Extras (Over-inclusve) Multiple subject/persons (Disambiguate) 6/4/2012 22
  • 23. Fact or Myth? Manual review by humans of large amounts of information is as accurate and complete as possible - perhaps even perfect - and constitutes the gold standard by which all searches should be measured This is ‚The reigning Myth of ‘perfect’ retrieval using traditional means‛ Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery The Sedona Conference Journal (2007) p. 199 Human beings retrieved less than 20% of the relevant documents when they believed they were retrieving over 75% An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System Blair & Maron (1985) 6/4/2012 23
  • 24. Blair and Maron 1985 A classic study of retrieval effectiveness – earlier studies were on unrealistically small collections Studied an archive of documents for a legal suit – ~350,000 pages of text – 40 queries – focus on high recall – Used IBM‟s STAIRS full-text system Main Result: – The system retrieved less than 20% of the relevant documents for a particular information need; lawyers thought they had 75% But many queries had very high precision
  • 25. Blair and Maron, cont. How they estimated recall – generated partially random samples of unseen documents – had users (unaware these were random) judge them for relevance Other results: – two lawyers searches had similar performance – lawyers recall was not much different from paralegal‟s
  • 26. Blair and Maron, cont. Why recall was low – users can‟t foresee exact words and phrases that will indicate relevant documents • “accident” referred to by those responsible as: “event,” “incident,” “situation,” “problem,” … • differing technical terminology • slang, misspellings – Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied
  • 27. Keyword Search Summary Pro Con Word Stemming Over-inclusive –Hous* - –Disambiguate house, housemate, household Under-inclusive Easy to use/explain/agree Word must be present Familiar Hard to craft Ineffective with short Fast results messages, IMs 6/4/2012 27
  • 28. Keyword Truths Under-inclusive - missing relevant or important info Over-inclusive - costly to review “Reasonable Keyword Search” doesn‟t exist Effective keyword search is difficult/impossible – Index Data, Analyze Index – Suggest keywords or approach Keywords may not be appropriate for the data Keyword Search is ONE Tool in Your Arsenal 6/4/2012 28
  • 29. Keyword Accuracy Example Keyword search reduced the document set by only 47% And 88% of the documents returned by keyword search were not responsive (Over-inclusive) 8,553 responsive documents missed by keyword search (Almost 8% of responsive documents missed by keyword search - Under-inclusive) 6/4/2012 29
  • 30. Search Methodology Continuum Review Methodology - Decided Upfront Identify Issues in the Case – Formulate Queries and Approaches for Finding Responsive Documents – Formulate Relevancy and Responsiveness Guidelines Identify Primary Participants Select or Triage Documents for Review 6/4/2012 30
  • 31. Review Tools for Relevancy Assessment Keyword Searches, Culling – Slices of Data are Reviewed Categorization of Data – Entire Dataset is Categorized – Review Targeted Data Automated Review – Categorization of Dataset – Random Sampling (Statistically Significant) 6/4/2012 31
  • 32. Categorization of Data for Review Categorize Entire Data Set – Spam/Porn/System Files – Personal/Private Data – Non-relevant Business Data Business Data – Relevancy Assessment by Topic – Privilege Review Keyword, Topic Analysis - Overlap, Holes 6/4/2012 32
  • 33. Search Methodologies Visualization Measurement Relationship Analysis documents with causal or sequential relationship Context Social Network Analysis relationships among relevant people relationships among relevant people Clustering Clustering Ontology Ontology Concept similarity of similarity of generalized generalized salient features salient features words or phrases words or phrases specific exact words, Content Keyword Keyword specific exact words specific exact words proximity searches, stemming 6/4/2012 33
  • 34. Categorization Methods Statistical Methods (#s based) – Topic Clustering • Statistical Similarity • Counting #s of words, appearance together – Latent Semantic Indexing – Supervised v. Unsupervised Clustering Linguistic Methods (Word Based) – Keyword (Culling Method) – Ontologies 6/4/2012 34
  • 35. Clustering Clustering just means putting documents into groups that have something in common. Manually (that's what manual review is) Keyword Searches Ontologies (linguistic filters) Automated clustering (using technology) – Automated clustering by document type (all the Word documents go into one basket – Automated clustering by creation date – Automated clustering by Actor – Automated clustering by statistical similarity (statistical clustering) – ... and many other approaches

 6/4/2012 35
  • 36. Clustering -- “Options” 1 Cluster or 4 Clusters Financial/energy trading options Email/computer menu-driven options Stock options (ISO's) The generic idea of an available choice of action 6/4/2012 36
  • 37. Clustering Software implements statistical methods of finding groups of “similar” documents – “Similar” must be defined appropriately for the application Documents are categorized with very little effort by the user May help with document review – A single reviewer can look at similar documents together, produce consistent review decisions – Tight clustering can be used to detect “near duplicates” caused by OCR errors 6/4/2012 37
  • 38. Clustering vs. queries Clustering is unpredictable compared to keywords or taxonomies The items that look very similar (to the clustering algorithm) may not actually be similar in ways that matter – Relevancy may depend upon fine legal distinctions – May vary in the same matter by subpoena and/or jurisdiction 6/4/2012 38
  • 39. Ontologies Implement ontologies for directed searches. – Approach searching from a knowledge-representation viewpoint – Field is 25 years old, lots of work done – Advantages: • Disambiguate different meanings of the same word from their context  More accurate • Encapsulate many ways of saying the same thing  More thorough • Search for concepts, not individual words  More intuitive, more reusable, and faster Can be combined with other methods (unsupervised clustering, discussions). 6/4/2012 39
  • 40. Subjectivity GOOD WEATHER – Sun – Calm BAD WEATHER – Rain – Snow – Wind 6/4/2012 40
  • 41. A More Realistic Ontology ROYALTY CONCEPT • royalty • charge for use • royalties • charged for use • rty • charging for use • commission • commissions • charges for use • comm. • licence fee • honorarium • license fee • honorariums • lisense fee • honoraria • “take cut”~2 • usage fee • “takes cut”~2 • usage charge • “took cut”~2 • usg fee • “slice pie”~5 • use fee • “piece pie”~5 • fee for use • fee for usage • “piece action”~5 • incent* • “slice action”~5 • insent* • -king • earn a fee • -queen • eam a fee • -prince • -princess 6/4/2012 41
  • 42. Ontology as a Query But it can be slightly cumbersome to deal with directly in that form q ((+(std:%CapacityReports_% std:%DINCapacity_%) +(std:%ACMEEPPlant_% std:%ProductName_%)) (+(std:%ACMEPNPlant_% std:%ProductName_%) +(std:%ProductiveCapability_% std:%CapacityReports_%)) (+(std:%CapacityCreep_% std:%OperationsImprovement_% std:%CapacityExpansion_% std:%CapacityRestoration_%) +(std:%ACMEPNPlant_% std:%ProductName_%)) (+(std:%EquipmentReplacement_% std:%FinishingColumn_%) +(std:%ACMEPNPlant_% std:%ProductName_%)) (std:%Audit_% actor:%Audit_%) (+(std:%SettlementNegotiations_% std:%ContractNegotiations_% ) +(actor:%ACMEOutsideCounsel_% std:%ACMEOutsideCounsel_% actor:%ACME UBOutsideCounsel_% std:%AcmeSubOutsideCounsel_% actor:%AcmeSub_% std:%AcmeSub_%)) (std:%FTC_% actor:%FTC_%) ((+subject:%ProductName_% +(std:swap std:"supply agreement" std:"exchange agreement" std:"agree to exchange")) std:"name (About a quarter of its regular size) 6/4/2012 42
  • 43. Ontology Pros & Cons Identify acronyms Normalize variants Disambiguate terms Identify overly broad keywords Identify and correct keywords with errors Create extensive libraries of ontologies Can be used as a clustering method Topics can appear in more than one languages Reusable for different types of litigation, e.g. anti-trust, product liability etc. (and for both offense and defense) As with Keyword - word based Labor intensive, upfront 6/4/2012 43
  • 44. “Search” Terminology Technology-Enhanced Review Technology Assisted Review Automated Review Predictive Coding People • Process • Privilege • Workflow • Subject Matter • Production • Review • Feedback Quality Technology Control 6/4/2012 44
  • 45. Setup Sample Responsive Non- Expert judges sample responsive Repeat as needed Model learns Model predicts Responsive Non-responsive Model categorizes all remaining documents
  • 47. Technology Enhanced Review: Speed, Predictable Costs, and Accuracy Automate any portion of the review Source Eliminate Data Duplicates & System Files 100% Non-Responsive 30% Isolation Example from a real case ontologies NR by 30% Technology Responsive Enhanced by Technology Review Enhanced (removed Review Priv by another 18%) (removed High-Speed another 7%) Manual Review 22% 3% 15% 6/4/2012 47
  • 48. Search Methodologies Visualization Measurement Relationship Analysis documents with causal or sequential relationship Context Social Network Analysis relationships among relevant people relationships among relevant people Clustering Clustering Ontology Ontology Concept similarity of similarity of generalized generalized salient features salient features words or phrases words or phrases specific exact words, Content Keyword Keyword specific exact words specific exact words proximity searches, stemming 6/4/2012 48
  • 49. From Document Analysis to Social Network Analysis 6/4/2012 49
  • 50. From Social Network Analysis to Discussions 6/4/2012 50
  • 51. Search Methodologies Visualization Measurement Relationship Analysis documents with causal or sequential relationship Context Social Network Analysis relationships among relevant people relationships among relevant people Clustering Clustering Ontology Ontology Concept similarity of similarity of generalized generalized salient features salient features words or phrases words or phrases specific exact words, Content Keyword Keyword specific exact words specific exact words proximity searches, stemming 6/4/2012 51
  • 52. Analytics are Based on the Model and on Discussions Analytics 6/4/2012 52
  • 53. Better Answers and Better Questions When were customary work practices circumvented? When did established norms of behavior change? Who knew, or likely knew, what facts? Who interacted with whom and how intimately? Who was involved in what types of decisions or meetings? Who are the real „insiders‟? What data is hidden or missing? When were electronically documented conversations “taken off line,” possibly in an attempt to avoid detection? How did the importance of different actors change over time? 6/4/2012 53
  • 54. Bear Stearns Lower Bar For Fraud? Two hedge fund managers arrested Charged with securities and wire fraud, and one with insider trading Internal emails: – “I'm fearful of these markets. ... As we discussed it may not be a meltdown for the general economy but in our world it will be.” – “I think we should close the funds now .” External communications: – “We are very comfortable with exactly where we are.” – “The funds are performing exactly as they were designed to.” 6/4/2012 54
  • 56. Analysis of Anomalous Communication Patterns Unusual levels relative to a particular type of activity pop out Color-coded graphs show relative communication densities for apples to apples comparisons 6/4/2012 56
  • 58. Emotive Tone Whistle-blower Scenario 6/4/2012 58
  • 59. “Call Me” Events Sequence Viewer used for analytics-driven review 6/4/2012 59
  • 60. Search Risks Failure to find responsive documents Failure to recognize responsive documents Failure to recognize privileged documents Inconsistent treatment of documents (e.g., duplicates) Failure to complete project in a timely manner Sophisticated Tools – Understand What They Do and Don‟t Do Well – Inform Yourself, Speak to References, Consultants 6/4/2012 60
  • 61. Transparency of Process Discussing Review Protocols – Provide transparent, defensible, sophisticated search based on document content – Clustering, Ontologies, Analytics, and yes, sometimes Keywords too Develop search methodologies for each case – Use technology experts in consultation with case / legal experts Results verifiable by Quality Control – Defensible sampling 6/4/2012 61
  • 62. Thank you! Sonya L. Sigler Vice President, Product Strategy SFL Data 415-321-8385 sonya@sfldata.com www.sfldata.com 6/4/2012 62
  • 63. Review Protocol ≠ Agreeing to Search Terms Data Culling (upfront or backend) Search Methodologies - Continuum – Keyword Positive List – Ontologies – Clustering – Technology Enhanced Review – Relationship Analysis Quality Control Process & Procedures Privilege Review, Sensitivities Production Format & Timing 6/4/2012 63
  • 64. Search The Courts are Finally Starting to Catch up to Technology Making more aggressive rulings: – Forcing attorneys to live with the results of bad searches – Sanctioning those who screw up, even if no allegation of fraud – Demanding repeatable, demonstrable process – using terms like “quality assurance” 6/4/2012 64
  • 65. Search Under Scrutiny Facciola’s Opinions - United States v. O’Keefe “for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than [other] search terms … is truly to go where angels fear to tread.” He has also suggested that litigants take a good look at more advanced search methodologies, including the use of computational linguistics and technology assisted review 6/4/2012 65
  • 66. Reasonableness of Search Methods Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md., May 29, 2008). "Common sense suggests that even a properly designed and executed keyword search may prove to be over-inclusive or under-inclusive...the only prudent way to test the reliability of the keyword search is to perform some appropriate sampling." “Selection of the appropriate search and information retrieval technique requires careful advance planning by persons qualified to design effective search methodology. The implementation of the methodology selected should be tested for quality assurance; and the party selecting the methodology must be prepared to explain the rationale for the method chosen to the court, demonstrate that it is appropriate for the task, and show that it was properly implemented.” 6/4/2012 66
  • 67. From Pre-Discovery to Production Completeness Henry v. Quicken Loans --> 26(f) consulting – Lawyers agreed to keyword lists and process – Ran own (unsanctioned) searches with expert – Told to live with bad results, and pay for it Qualcomm --> Smell Test; Dig Deeper – In-house counsel (Qualcomm) v. Outside Counsel (Day Casebeer) – Sanctions, Attorney Client-Privilege Problems – Associate found docs and told they weren‟t relevant; found out the hard way that those and 230,000 other pages were relevant Judge Rader‟s Protocol in TX for Patent cases – 5 custodians – 5 search terms (can you say over broad…) 6/4/2012 67
  • 68. Under-inclusive - Missing in Action Missing abbreviations / acronyms / clippings: – incentive stock option but not ISO – Board of Directors but not BOD – 1998 plan but not 98 plan Missing inflectional variants: – grant but not grants, granted, granting Missing spellings or common misspellings: – gray but not grey – privileged but not priviliged, priviledged, privilidged, priveliged, privelidged, pri veledged, … 6/4/2012 68
  • 69. Missing in Action II Missing syntactic variants: board of directors meeting but not meeting of the board mtg of the directors of directors BOD meetings BOD meeting board meetings board meeting BOD mtgs BOD mtg board mtgs board mtg directors’ meetings directors’ meeting directors’ mtgs directors’mtg mtgs of the BOD mtg of the BOD mtgs of the directors 6/4/2012 69
  • 70. Missing in Action III Missing synonyms / paraphrases: hire date but notstart date approved by Smith but not Smith’s approval the goahead from the approval of Smith Smith Smith’s ok the nod from Smith Smith’s go-ahead Smith’s signature Smith’s goahead Smith’s sign-off the go-ahead from the sign-off of Smith Smith the signoff of Smith 6/4/2012 70
  • 71. Missing in Action IV As a keyword item, the address 101 E. Bergen Ave., Temple, CA 90200 does not match any of: 101 East Bergen Avenue the Bergen site the Temple location our 90200 outlet 6/4/2012 71
  • 72. Over-inclusive - Unwanted Extras Options Target: Sheila was granted 100,000 options at $10 Match: What are our options for lunch? Match in a signature line: Amanda Wacz Acme Stock Options Administrator Destroy Target:destroyevidence Match in a disclaimer: The information in this email, and any attachments, may contain confidential and/or privileged information and is intended solely for the use of the named recipient(s). Any disclosure or dissemination in whatever form, by anyone other than the recipient is strictly prohibited. If you have received this transmission in error, please contact the sender and destroy this message and any attachments. Thank you. 6/4/2012 72
  • 73. Unwanted Extras II alter* Target: alter, alters, altered, altering Matches: alternate, alternative, alternation, altercate, altercation, alt erably, … grant Target:stock optiongrant Matches names:GrantWoods, HowardGrant 6/4/2012 73
  • 74. Tuning an Ontology Linguists briefed as reviewers Linguists read the data Linguists study complaint and other relevant documents Linguists analyze the search index Legal Team provides input, feedback 6/4/2012 74
  • 75. A Simple Linguistic Ontology ROYALTY CONCEPT – Royalty – Commission – Honorarium – Usage Fee – Slice of the Pie 6/4/2012 75
  • 76. A Simple Pricing Concept PRICING CONCEPT – Purchase Order – PO – Dollar amount – Invoice 6/4/2012 76
  • 77. Adding Subjective Content PRICING CONCEPT – Purchase Order – PO – Dollar amount – Invoice – Cylinder – Canister – Bottle 6/4/2012 77
  • 78. Ontology Usage Identifying Misspellings, Slang, Nicknames, etc. Variant Generation – help the user find what he meant (names, words, suggestions) – Buy* Buying, Buys, Bought, etc. – Kenneth Lay, Ken Lay, klay, kenneth.lay View variations in context to choose topics Document segmentation – text blocks, signatures Finding Words in Context, Frequency at serious risk of losing 25 are certain risks inherent in 16 6/4/2012 78
  • 79. Identifying misspellings, slang, etc 1. Match the index against electronic dictionary. 2. From the remaining material (not in dictionary), remove any items that are merely numbers. 3. Find (in the ontologies) any words that are similar to what remains. 4. Add the similar words to the ontology This increases coverage (i.e., ensures that we retrieve documents that otherwise would have been missed) 6/4/2012 79
  • 80. Variant Generation Help the user find out search for what he meant Take names, numbers, and other entities for which the user wants to search Automatically generate likely synonyms 6/4/2012 80
  • 81. Variant Generation Show the context of these variations, so the user can evaluate them. 6/4/2012 81
  • 82. Document Segmentation Examples of signatures Jean-Louis Koenig President GGDA Region MegaCorp International SA Robert Guilliam Rue de Concours 2280 Product Regulatory Affairs &Compliance Bern, Switzerland MegaCorp International Neuchatel Switzerland Tél. +41 (31) 125 2366 Alberto Goreman Manager Printing &Packaging, Eastern Region +57 3 451 7195, alberto_goreman@megacorp.com 6/4/2012 82
  • 83. Finding words in context Phrase Total Instances risks alienating some 37 at serious risk of losing 25 are certain risks inherent in 16 are at risk of running 15 it be risking anything by 15 difference a risk o why 14 and the risks inherent in 12 without assuming any risk 8 we could risk losing next 7 avoid transferring risk to the 5 requires taking risks and the 4 can t risk not living 3 and unknown risks and uncertainties 2 a potential risk that was 2 avoid transfering risk to the 2 This increases coverage AND precision 6/4/2012 83
  • 84. Multi-Lingual Issues Does language matter? – Lucerne – Luzerne – Lucerna These places were all the same city Name of city not necessarily expressed in the same language as rest of document In Europe, many email threads and documents are mixed language, and must be properly categorized as such 6/4/2012 84
  • 85. Automated Ontology Expansion Tools Currently implemented expansion modules: Spelling variants: color>>colour, defense>>defence, labeled>>labelled Lemmatization (recovering uninflected form): walking>>walk, ate>>eat Morphological variants: eat>>eats, eating, eaten, ate hablar>>hablo, hablas, habla, hablan, habláis, hablamos Number expansion: $2.5B>>two point five billion dollars 2,567>>two thousand five hundred sixty seven 13>>13th, thirteenth Name variants: Elizabeth Van der Beek>>“Liz Van der Beek”, “Liz Vander Beek”, “Van der Beek, Elizabeth”, “Beth Vanderbeek”, etc. Email variants (mined from alias clusters file): Elizabeth Van der Beek>>evanderbeek, liz.vanderbeek, vanderbeekl, emvanderbeek, etc. Abbreviations: administrative project meeting>>admin project meeting, admin project mtg, admin proj mtg, etc. 6/4/2012 85

Editor's Notes

  1. Investors sued to recover losses from the liquidation of two hedge funds2003 – they retained counsel to help file suit – counsel advised them to retain documents to file a complaint2004 – filed complaint, stayed until 20072007 – depositions revealed gaps in PLAINTIFF’s production of docsMonetary sancitons against 13 plaintiffs, and had to process and produce back-up tapes at their own expense.
  2. Pension Committee found gross negligence and willfulness of their failure to include board members or investment committee members and gross negligence that they did not collect data form former employees
  3. Pension Committee – monetary sanctions on 13 plaintiffs; some parties had to process and produce back-up tapes at their own expense