SlideShare a Scribd company logo
Demys&fying	
  Technology	
  
   Assisted	
  Review	
  


Part	
  3:	
  Deconstruc&ng	
  the	
  
              Technology	
  

        Sonya	
  L.	
  Sigler	
  
Agenda	
  

  Review/Overview	
  
  Underlying	
  Search	
  Technology	
  
        dtSearch	
  
        Lucene	
  (open	
  source)	
  
        Others	
  –	
  My	
  SQL,	
  etc.	
  
  Underlying	
  StaCsCcal	
  Based	
  Technology	
  
        Rules	
  Based	
  Technology	
  (LinguisCc	
  or	
  StaCsCcal)	
  
        Bayesian	
  ProbabilisCc	
  Technologies	
  
        Latent	
  SemanCc	
  Indexing	
  
  Q	
  &	
  A	
  

                                                      	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Review/Overview	
  -­‐	
  Search	
  &	
  Review	
  Spectrum	
  


               Linear	
  Review	
  
               Culling	
  
               IteraCve	
  search	
  
               Review	
  


                                                    Accelerated	
  Review	
  	
  
                                                    Email	
  Threading	
  
                                                    Near	
  Duplicate	
  DetecCon	
                                   Automated	
  Review	
  	
  
Per	
  	
                                           CA	
  -­‐	
  Clustering	
                                         Relevance	
  Ranking	
  
Document	
                                          CategorizaCon	
  (Supervised)	
                                   Machine	
  Learning	
  
Cost	
  
                                                                                                                      Latent	
  SemanCc	
  Indexing	
  
                                                                                                                      (staCsCcal	
  probability)	
  
                                                                                                                      PaRern	
  Analysis	
  
                                                                                                                      Sampling	
  Data	
  for	
  High	
  
                                                                                                                      Precision	
  and	
  Recall	
  Rates	
  

                                        Organiza3on	
  Commitment	
  
                                                                                  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Underlying	
  Technologies	
  
                                          Rules	
  Based	
  Systems	
  
                                                                               dtSearch	
  
                                                           Key	
  word	
  Search	
  
                                                                                                                              Ontologies	
  

                                                                     Lucene	
                                   Other	
  Search	
  Engines	
  

                                                               LinguisCc	
  –	
  word	
  based	
  



StaCsCcal	
  -­‐	
  #s	
  based	
  
     Bayesian	
  ClassificaCon	
  

                            Support	
  Vector	
  Models	
  
  Latent	
  SemanCc	
  Indexing	
  

                                                                   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Database	
  NormalizaCon	
  

From:	
  Nuala	
  Coogan	
  Nuala@SFLData.com	
  
Subject:	
  EDI	
  Summit	
  –	
  Florida	
  
Date:	
  October	
  3,	
  2012	
  10:11:21	
  AM	
  PDT	
  
To:	
  Sigler	
  L.	
  Sonya	
  Sonya@sigler.name	
  	
  

From:	
  Nuala	
  Coogan	
  
Subject:	
  EDI	
  Summit	
  –	
  Florida	
  
Date:	
  10/03/12	
  
To:	
  Sonya	
  Sigler	
  

                                                	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
TokenizaCon	
  

  Words,	
  Phrases,	
  Symbols	
  
      Mostly	
  at	
  the	
  word	
  level	
  
      Numbers	
  
      PunctuaCon	
  
  Meaningful	
  Elements	
  or	
  Pieces	
  –>	
  Tokens	
  
  Parsing	
  and	
  Text	
  Mining	
  
  Treatment	
  of	
  ContracCons,	
  Hyphenated	
  words,	
  
   EmoCcons	
  and	
  Larger	
  Constructs	
  (like	
  urls)	
  
  Look-­‐up	
  tables	
  


                                                  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
LinguisCc	
  Based	
  Technologies	
  

  Keyword	
  Sample	
                                                      Ontology	
  Sample	
  

	
  Simple:	
                                                                  	
  	
  q	
  ((+(std:%CapacityReports_%	
  std:%DINCapacity_
                                                                                   %)	
  	
  (std:%ACMEEPPlant_%	
  std:%ProductName_%))	
  
	
  "legal	
  systems"	
  OR	
  legalsystems	
                                     (+(std:%ACMEPNPlant_%	
  std:%ProductName_%)	
  +
	
  "Mike	
  Custodian”	
  	
                                                      (std:%ProducCveCapability_%	
  std:
                                                                                   %CapacityReports_%))	
  (+(std:%CapacityCreep_%	
  
                                                                                   std:%OperaConsImprovement_%	
  std:
	
  Medium:	
                                                                      %CapacityExpansion_%	
  std:%CapacityRestoraCon_
	
  mail(custodian@domain.com)	
  AND	
  "legal	
  systems”	
   %)	
  +(std:%ACMEPNPlant_%	
  std:%ProductName_
                                                                                   %))	
  (+(std:%EquipmentReplacement_%	
  std:
	
  (Custodian	
  w/3	
  (Mike	
  OR	
  Michael	
  OR	
  M))	
                     %FinishingColumn_%)	
  +(std:%ACMEPNPlant_%	
  std:
                                                                                   %ProductName_%))	
  (std:%Audit_%	
  actor:%Audit_
	
  Complex:	
                                                                     %)	
  (+(std:%SeRlementNegoCaCons_%	
  std:
                                                                                   %ContractNegoCaCons_%	
  )	
  +(actor:
	
  (privilege	
  OR	
  privileged	
  OR	
  legally	
  OR	
  "work	
               %ACMEOutsideCounsel_%	
  std:
    product")	
  NOT	
  w/35	
  (((original	
  OR	
  intended	
  OR	
   %ACMEOutsideCounsel_%	
  actor:%ACME	
  
    designated	
  OR	
  named)	
  w/3	
  (recipient	
  OR	
                        UBOutsideCounsel_%	
  std:
    recipients	
  OR	
  addressee	
  OR	
  addressees	
  OR	
                      %AcmeSubOutsideCounsel_%	
  actor:%AcmeSub_%	
  
    solely))	
  OR	
  ("message	
  in	
  error")	
  OR	
  ("received	
  in	
   std:%AcmeSub_%))	
  (std:%FTC_%	
  actor:%FTC_%)	
  
                                                                                   ((+subject:%ProductName_%	
  +(std:swap	
  
    error")	
  OR	
  ("named	
  above")	
  OR	
  ((electronic	
  or	
              std:"supply	
  agreement"	
  std:"exchange	
  agreement"	
  
    email	
  or	
  e-­‐mail)	
  w/3	
  (message	
  or	
  transmission))	
   std:"agree	
  to	
  exchange"))	
  std:"name	
  
OR	
  ("confidenCality	
  noCce"))	
  

                                                                                    	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Search	
  Engines	
  

  dtSearch	
  
        dtSearch	
  Corp.,	
  founded	
  1991	
  
        Incorporated	
  into	
  Symantec’s	
  Norton	
  Navigator	
  
        SDKs	
  available,	
  most	
  license	
  off	
  the	
  shelf	
  
        hRp://support.dtsearch.com/faq/search.html	
  
  Lucene	
  
      Open	
  source	
  -­‐	
  hRp://lucene.apache.org/core/	
  
      Doug	
  Cukng,	
  1999,	
  Part	
  of	
  Apache	
  projects	
  in	
  2001	
  
      APIs,	
  Customizable	
  
  Other	
  –	
  My	
  SQL,	
  SQL,	
  (DBMS,	
  RDBMS)	
  

                                                        	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
dtSearch	
  

    RelaCvity,	
  Concordance,	
  Viewpoint,	
  others	
  
    Single	
  User	
  desktop	
  license	
  $199	
  
    LiRle	
  CustomizaCon	
  –	
  more	
  similariCes	
  across	
  apps	
  
    Includes	
  Boolean	
  operators	
  
    Includes	
  Proximity	
  searching	
  
    Includes	
  Fuzzy	
  Searching	
  
        Alphabet	
  -­‐>	
  Alphaqet,	
  alpphabet,	
  alpkaqet	
  




                                                      	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Lucene	
  

  Clearwell,	
  Intella,	
  Cataphora,	
  SHIFT,	
  others	
  
  Open	
  Source	
  Tool	
  –	
  meant	
  to	
  be	
  customized	
  
  LiRle	
  SimilariCes	
  Across	
  Apps	
  
      Know	
  your	
  defaults!	
  
  Includes	
  Boolean	
  Operators	
  
  Includes	
  Proximity	
  Searching	
  




                                           	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
dtSearch	
  –	
  Fuzzy	
  Searching	
  

  Degrees	
  of	
  Fuzziness	
  
        1-­‐10;	
  dtSearch	
  uses	
  1-­‐3	
  
        Marked	
  by	
  use	
  of	
  %	
  symbol	
  
        InserCon:	
  co%t	
  →	
  coat	
  
        DeleCon:	
  coat	
  →	
  co%t	
  
        SubsCtuCon:	
  coat	
  →	
  cost	
  
        TransposiCon	
  cots	
  →	
  cost	
  
  Fuzziness	
  Degrees	
  
      Alphabet	
  –	
  Alphaqet,	
  Alpkaqet	
  



                                                        	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Boolean	
  Operators	
  –	
  AND,	
  OR	
  ,	
  NOT	
  
dtSearch	
                                Lucene	
  
  Search	
  for	
                          Depends	
  on	
  
      MulCple	
  words	
                    customizaCon	
  
       treated	
  as	
  a	
  phrase	
          OR	
  
  ANY	
  –	
  treats	
  word	
                AND	
  
   list	
  as	
  separated	
  by	
  
   OR	
                                     Know	
  your	
  defaults	
  
  ALL	
  –	
  treats	
  word	
  list	
     Spell	
  out	
  	
  variaCons	
  
   as	
  separated	
  by	
  AND	
  

                                              	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Proximity	
  
dtSearch	
                               Lucene	
  
  Pre/post	
                              w/	
  order	
  doesn’t	
  
     w/	
  order	
  doesn’t	
              maRer	
  
      maRer	
  
        House	
  white	
                  No	
  pre	
  usage	
  
        White	
  house	
  
     Pre/	
  finds	
  first	
  word	
  
      prior	
  to	
  second	
  
      word	
  
        White	
  house	
  

                                              	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Punctua&on	
  

dtSearch	
                              Lucene	
  
  LeRers	
                               All	
  punctuaCon	
  
  Space	
                                 treated	
  as	
  a	
  word	
  
  Ignored	
  	
                           break	
  
  Hyphens	
  
  %	
  -­‐	
  fuzzy	
  searching	
  
  _	
  -­‐	
  ignored	
  


                                              	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
dtSearch	
  Hyphen	
  Example	
  




                                	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Noise	
  Words	
  –	
  Unindexed,	
  Ignored	
  

dtSearch	
                             Lucene	
  
  Unindexed,	
  Can	
                    Ignores	
  *	
  in	
  quotes	
  
   create	
  Custom	
  Index	
     (Quality	
  Control*)	
  =	
  
  Many,	
  but	
  a	
  few	
              Quality	
  Control	
  but	
  
   examples:	
  Do,	
  not,	
              nothing	
  else	
  
   for,	
  your,	
  only,	
  under,	
  
   made,	
  way	
  
  Know	
  defualts	
  


                                              	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Stemming	
  v.	
  Wild	
  Cards	
  
 Stemming	
                           Wild	
  Cards	
  
  SyntacCc	
  VariaCons	
             Strings	
  of	
  characters	
  
                                       Replacements	
  for	
  beginning,	
  
      Regular	
  Verbs	
  
                                        parts,	
  or	
  endings	
  
      Irregular	
  Verbs	
            Lucene	
  -­‐	
  *	
  
  dtSearch	
  performs	
              dtSearch	
  -­‐	
  ?	
  For	
  single	
  
                                        character,	
  *	
  for	
  any	
  #	
  of	
  
   poorly	
  with	
  irregular	
  
                                        characters	
  
   verbs	
                             Time	
  consuming	
  
                                       Spelling	
  out	
  recommended	
  
                                       Wild	
  cards	
  in	
  quotes	
  


                                            	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Stemming	
  v.	
  Wild	
  Cards	
  Example	
  

Stemming	
                                     Wild	
  Cards	
  

Catch	
  –	
  Lucene	
                         Catch*	
  
Catch~	
  -­‐	
  dtSearch	
                             Catch	
  
      Catch	
                                          Catches	
  
      Catches	
                                        Catching	
  
      Catching	
                                       Catcher	
  
      Catcher	
                                        Catch1234	
  –	
  not	
  in	
  
      Caught	
  -­‐	
  not	
  in	
  dtSearch	
  	
      stemming	
  


                                                      	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
StaCsCcal	
  Technologies	
  

  Rules	
  Based	
  
  Bayesian	
  ClassificaCon	
  
  Vector	
  Space	
  Modeling	
  
     Latent	
  SemanCc	
  Indexing	
  




                                          	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
StaCsCcal	
  Based	
  Technologies	
  

  Concept	
  -­‐	
  Clustering	
  
      Machine	
  
      Unsupervised	
  
      Quickly	
  understand	
  	
  
      	
  data	
  
      Uncontrolled	
  Clusters	
  




                                       	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
StaCsCcal	
  Based	
  Technologies	
  

  Concept	
  -­‐	
  Categoriza&on	
  	
  
      User	
  Created	
  
      Supervised	
  
      Control	
  Topics	
  
      Time	
  Consuming	
  




                                             	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
StaCsCcal	
  Based	
  Technologies	
  

Rules	
  Based	
  Systems	
  
  If..	
  Then…	
  
       If	
  email	
  =	
  person	
  1	
  to	
  person	
  2	
  then	
  return	
  it	
  
       If	
  email	
  =	
  person	
  1	
  or	
  person	
  2	
  then	
  return	
  it	
  
  ArCficial	
  Intelligence	
  Systems	
  
       EnCty	
  extracCon	
  (&	
  dicConaries)	
  
       Time	
  consuming	
  
       Mirror	
  human	
  thinking	
  
             Case,	
  subject	
  maRer	
  
  Transparent	
  System	
  

                                                                   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
StaCsCcal	
  Based	
  Technologies	
  

  Bayesian	
  ClassificaCon	
  	
  
      ProbabilisCc	
  
      Co-­‐occurrence	
  
      Frequency	
  
  Spam	
  Filters	
  
      Viagra	
  
      Concepts	
  
      Words,	
  phrases	
  



                                      	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Bayesian	
  	
  

  Bayesian	
  illustraCon	
  
      Baseball,	
  glove,	
  diamond,	
  bats,	
  hit,	
  home	
  run	
  
      Diamond,	
  pendant,	
  jewelry	
  


  Co-­‐occurrence	
  
      Local	
  –	
  within	
  a	
  document	
  
      Global	
  –	
  across	
  document	
  populaCon	
  


  Frequency	
  –	
  how	
  ozen	
  does	
  it	
  appear	
  
      WeighCng	
  –	
  uniqueness	
  counts	
  

                                                       	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
StaCsCcal	
  Based	
  Technologies	
  

  Vector	
  Space	
  Modeling	
  
  Latent	
  Seman&c	
  Indexing/Analysis	
  
      Words	
  
      Phrases,	
  Concepts	
  
      Tables	
  
      Algebraic	
  equaCons	
  represenCng	
  docs	
  
      WeighCng	
  Algorithms	
  




                                          	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Latent	
  SemanCc	
  Indexing	
  Example	
  

  Exclude	
  Noise	
  Words	
  
      The,	
  and,	
  or,	
  etc.	
  
  Vector	
  Space	
  Modeling	
  
      Build	
  Document	
  Profile	
  
  Diamond	
  
      Base,	
  ball	
  
      Necklace,	
  pendant	
  
      Diamond	
  Saw	
  




                                         	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
MathemaCcal	
  FoundaCon	
  

  Tables	
  built	
  with	
  0s,	
  1s	
  
  Yes	
  it	
  has	
  that	
  word	
  or	
  phrase	
  
  No	
  it	
  doesn’t	
  




                                                          	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Simple	
  Matrix	
  with	
  WeighCng	
  




                                  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Weighted	
  by	
  Document	
  (not	
  just	
  type)	
  




                                     	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Defensibility	
  Report 	
  	
  

    Document,	
  Document,	
  Document	
  
    Transparency	
  
    Workflow	
  
    What	
  Was	
  Considered,	
  By	
  Whom?	
  
    QC	
  Process	
  
    Metrics	
  




                                             	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  
Q&A - Thank you!	
  


       Post	
  your	
  ques&ons	
  to	
  the	
  
      presenter	
  in	
  the	
  chat	
  secCon	
  

                         Sonya	
  L.	
  Sigler	
  
   Vice	
  President,	
  Product	
  Strategy	
  &	
  Consul&ng	
  
                            SFL	
  Data	
  
                         415-­‐321-­‐8385	
  
                   sonya@sfldata.com	
  	
  
                    www.sfldata.com	
  	
  



                         	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Demys&fying	
  Technology	
  Assisted	
  Review	
  

More Related Content

Similar to 2012 11 7 TAR Webinar Part 3 Sigler

2012 8 29 TAR Webinar Part 2 Sigler
2012 8 29 TAR Webinar Part 2 Sigler2012 8 29 TAR Webinar Part 2 Sigler
2012 8 29 TAR Webinar Part 2 Sigler
Sonya Sigler
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
Abhishek M Shivalingaiah
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
Intelie
 
EUGM 2014 - Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...
EUGM 2014 -  Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...EUGM 2014 -  Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...
EUGM 2014 - Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...
ChemAxon
 
Walking Through Cloud Serving at Yahoo!
Walking Through Cloud Serving at Yahoo!Walking Through Cloud Serving at Yahoo!
Walking Through Cloud Serving at Yahoo!
Yahoo Developer Network
 
B2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draftB2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draftSteve Feldman
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi
 
Fast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of MalwareFast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of MalwareSilvio Cesare
 
Sumo Logic Cert Jam - Security & Compliance
Sumo Logic Cert Jam - Security & ComplianceSumo Logic Cert Jam - Security & Compliance
Sumo Logic Cert Jam - Security & Compliance
Sumo Logic
 
Cloud Serving Engine
Cloud Serving EngineCloud Serving Engine
Cloud Serving Engine
sureddy
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Aditya Varun Chadha
 
Measuring Your Code
Measuring Your CodeMeasuring Your Code
Measuring Your Code
Nate Abele
 
Measuring Your Code 2.0
Measuring Your Code 2.0Measuring Your Code 2.0
Measuring Your Code 2.0
Nate Abele
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
Ian Foster
 
BioWeka
BioWekaBioWeka
BioWeka
Martin Szugat
 
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Priyanka Aash
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Julian Hyde
 
New developments in open source ecosystem spark3.0 koalas delta lake
New developments in open source ecosystem spark3.0 koalas delta lakeNew developments in open source ecosystem spark3.0 koalas delta lake
New developments in open source ecosystem spark3.0 koalas delta lake
Xiao Li
 

Similar to 2012 11 7 TAR Webinar Part 3 Sigler (20)

2012 8 29 TAR Webinar Part 2 Sigler
2012 8 29 TAR Webinar Part 2 Sigler2012 8 29 TAR Webinar Part 2 Sigler
2012 8 29 TAR Webinar Part 2 Sigler
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
EUGM 2014 - Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...
EUGM 2014 -  Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...EUGM 2014 -  Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...
EUGM 2014 - Brock Luty (Dart Neuroscience): A ChemAxon/KNIME based tool for ...
 
Walking Through Cloud Serving at Yahoo!
Walking Through Cloud Serving at Yahoo!Walking Through Cloud Serving at Yahoo!
Walking Through Cloud Serving at Yahoo!
 
B2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draftB2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draft
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Fast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of MalwareFast Automated Unpacking and Classification of Malware
Fast Automated Unpacking and Classification of Malware
 
Sumo Logic Cert Jam - Security & Compliance
Sumo Logic Cert Jam - Security & ComplianceSumo Logic Cert Jam - Security & Compliance
Sumo Logic Cert Jam - Security & Compliance
 
Cloud Serving Engine
Cloud Serving EngineCloud Serving Engine
Cloud Serving Engine
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010
 
Measuring Your Code
Measuring Your CodeMeasuring Your Code
Measuring Your Code
 
Measuring Your Code 2.0
Measuring Your Code 2.0Measuring Your Code 2.0
Measuring Your Code 2.0
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Lafauci dv club oct 2006
Lafauci dv club oct 2006Lafauci dv club oct 2006
Lafauci dv club oct 2006
 
BioWeka
BioWekaBioWeka
BioWeka
 
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
New developments in open source ecosystem spark3.0 koalas delta lake
New developments in open source ecosystem spark3.0 koalas delta lakeNew developments in open source ecosystem spark3.0 koalas delta lake
New developments in open source ecosystem spark3.0 koalas delta lake
 

2012 11 7 TAR Webinar Part 3 Sigler

  • 1. Demys&fying  Technology   Assisted  Review   Part  3:  Deconstruc&ng  the   Technology   Sonya  L.  Sigler  
  • 2. Agenda     Review/Overview     Underlying  Search  Technology     dtSearch     Lucene  (open  source)     Others  –  My  SQL,  etc.     Underlying  StaCsCcal  Based  Technology     Rules  Based  Technology  (LinguisCc  or  StaCsCcal)     Bayesian  ProbabilisCc  Technologies     Latent  SemanCc  Indexing     Q  &  A                        Demys&fying  Technology  Assisted  Review  
  • 3. Review/Overview  -­‐  Search  &  Review  Spectrum   Linear  Review   Culling   IteraCve  search   Review   Accelerated  Review     Email  Threading   Near  Duplicate  DetecCon   Automated  Review     Per     CA  -­‐  Clustering   Relevance  Ranking   Document   CategorizaCon  (Supervised)   Machine  Learning   Cost   Latent  SemanCc  Indexing   (staCsCcal  probability)   PaRern  Analysis   Sampling  Data  for  High   Precision  and  Recall  Rates   Organiza3on  Commitment                        Demys&fying  Technology  Assisted  Review  
  • 4. Underlying  Technologies   Rules  Based  Systems   dtSearch   Key  word  Search   Ontologies   Lucene   Other  Search  Engines   LinguisCc  –  word  based   StaCsCcal  -­‐  #s  based   Bayesian  ClassificaCon   Support  Vector  Models   Latent  SemanCc  Indexing                        Demys&fying  Technology  Assisted  Review  
  • 5. Database  NormalizaCon   From:  Nuala  Coogan  Nuala@SFLData.com   Subject:  EDI  Summit  –  Florida   Date:  October  3,  2012  10:11:21  AM  PDT   To:  Sigler  L.  Sonya  Sonya@sigler.name     From:  Nuala  Coogan   Subject:  EDI  Summit  –  Florida   Date:  10/03/12   To:  Sonya  Sigler                        Demys&fying  Technology  Assisted  Review  
  • 6. TokenizaCon     Words,  Phrases,  Symbols     Mostly  at  the  word  level     Numbers     PunctuaCon     Meaningful  Elements  or  Pieces  –>  Tokens     Parsing  and  Text  Mining     Treatment  of  ContracCons,  Hyphenated  words,   EmoCcons  and  Larger  Constructs  (like  urls)     Look-­‐up  tables                        Demys&fying  Technology  Assisted  Review  
  • 7. LinguisCc  Based  Technologies   Keyword  Sample   Ontology  Sample    Simple:      q  ((+(std:%CapacityReports_%  std:%DINCapacity_ %)    (std:%ACMEEPPlant_%  std:%ProductName_%))    "legal  systems"  OR  legalsystems   (+(std:%ACMEPNPlant_%  std:%ProductName_%)  +  "Mike  Custodian”     (std:%ProducCveCapability_%  std: %CapacityReports_%))  (+(std:%CapacityCreep_%   std:%OperaConsImprovement_%  std:  Medium:   %CapacityExpansion_%  std:%CapacityRestoraCon_  mail(custodian@domain.com)  AND  "legal  systems”   %)  +(std:%ACMEPNPlant_%  std:%ProductName_ %))  (+(std:%EquipmentReplacement_%  std:  (Custodian  w/3  (Mike  OR  Michael  OR  M))   %FinishingColumn_%)  +(std:%ACMEPNPlant_%  std: %ProductName_%))  (std:%Audit_%  actor:%Audit_  Complex:   %)  (+(std:%SeRlementNegoCaCons_%  std: %ContractNegoCaCons_%  )  +(actor:  (privilege  OR  privileged  OR  legally  OR  "work   %ACMEOutsideCounsel_%  std: product")  NOT  w/35  (((original  OR  intended  OR   %ACMEOutsideCounsel_%  actor:%ACME   designated  OR  named)  w/3  (recipient  OR   UBOutsideCounsel_%  std: recipients  OR  addressee  OR  addressees  OR   %AcmeSubOutsideCounsel_%  actor:%AcmeSub_%   solely))  OR  ("message  in  error")  OR  ("received  in   std:%AcmeSub_%))  (std:%FTC_%  actor:%FTC_%)   ((+subject:%ProductName_%  +(std:swap   error")  OR  ("named  above")  OR  ((electronic  or   std:"supply  agreement"  std:"exchange  agreement"   email  or  e-­‐mail)  w/3  (message  or  transmission))   std:"agree  to  exchange"))  std:"name   OR  ("confidenCality  noCce"))                        Demys&fying  Technology  Assisted  Review  
  • 8. Search  Engines     dtSearch     dtSearch  Corp.,  founded  1991     Incorporated  into  Symantec’s  Norton  Navigator     SDKs  available,  most  license  off  the  shelf     hRp://support.dtsearch.com/faq/search.html     Lucene     Open  source  -­‐  hRp://lucene.apache.org/core/     Doug  Cukng,  1999,  Part  of  Apache  projects  in  2001     APIs,  Customizable     Other  –  My  SQL,  SQL,  (DBMS,  RDBMS)                        Demys&fying  Technology  Assisted  Review  
  • 9. dtSearch     RelaCvity,  Concordance,  Viewpoint,  others     Single  User  desktop  license  $199     LiRle  CustomizaCon  –  more  similariCes  across  apps     Includes  Boolean  operators     Includes  Proximity  searching     Includes  Fuzzy  Searching     Alphabet  -­‐>  Alphaqet,  alpphabet,  alpkaqet                        Demys&fying  Technology  Assisted  Review  
  • 10. Lucene     Clearwell,  Intella,  Cataphora,  SHIFT,  others     Open  Source  Tool  –  meant  to  be  customized     LiRle  SimilariCes  Across  Apps     Know  your  defaults!     Includes  Boolean  Operators     Includes  Proximity  Searching                        Demys&fying  Technology  Assisted  Review  
  • 11. dtSearch  –  Fuzzy  Searching     Degrees  of  Fuzziness     1-­‐10;  dtSearch  uses  1-­‐3     Marked  by  use  of  %  symbol     InserCon:  co%t  →  coat     DeleCon:  coat  →  co%t     SubsCtuCon:  coat  →  cost     TransposiCon  cots  →  cost     Fuzziness  Degrees     Alphabet  –  Alphaqet,  Alpkaqet                        Demys&fying  Technology  Assisted  Review  
  • 12. Boolean  Operators  –  AND,  OR  ,  NOT   dtSearch   Lucene     Search  for     Depends  on     MulCple  words   customizaCon   treated  as  a  phrase     OR     ANY  –  treats  word     AND   list  as  separated  by   OR     Know  your  defaults     ALL  –  treats  word  list     Spell  out    variaCons   as  separated  by  AND                        Demys&fying  Technology  Assisted  Review  
  • 13. Proximity   dtSearch   Lucene     Pre/post     w/  order  doesn’t     w/  order  doesn’t   maRer   maRer    House  white     No  pre  usage    White  house     Pre/  finds  first  word   prior  to  second   word    White  house                        Demys&fying  Technology  Assisted  Review  
  • 14. Punctua&on   dtSearch   Lucene     LeRers     All  punctuaCon     Space   treated  as  a  word     Ignored     break     Hyphens     %  -­‐  fuzzy  searching     _  -­‐  ignored                        Demys&fying  Technology  Assisted  Review  
  • 15. dtSearch  Hyphen  Example                        Demys&fying  Technology  Assisted  Review  
  • 16. Noise  Words  –  Unindexed,  Ignored   dtSearch   Lucene     Unindexed,  Can     Ignores  *  in  quotes   create  Custom  Index     (Quality  Control*)  =     Many,  but  a  few   Quality  Control  but   examples:  Do,  not,   nothing  else   for,  your,  only,  under,   made,  way     Know  defualts                        Demys&fying  Technology  Assisted  Review  
  • 17. Stemming  v.  Wild  Cards   Stemming   Wild  Cards     SyntacCc  VariaCons     Strings  of  characters     Replacements  for  beginning,     Regular  Verbs   parts,  or  endings     Irregular  Verbs     Lucene  -­‐  *     dtSearch  performs     dtSearch  -­‐  ?  For  single   character,  *  for  any  #  of   poorly  with  irregular   characters   verbs     Time  consuming     Spelling  out  recommended     Wild  cards  in  quotes                        Demys&fying  Technology  Assisted  Review  
  • 18. Stemming  v.  Wild  Cards  Example   Stemming   Wild  Cards   Catch  –  Lucene   Catch*   Catch~  -­‐  dtSearch     Catch     Catch     Catches     Catches     Catching     Catching     Catcher     Catcher     Catch1234  –  not  in     Caught  -­‐  not  in  dtSearch     stemming                        Demys&fying  Technology  Assisted  Review  
  • 19. StaCsCcal  Technologies     Rules  Based     Bayesian  ClassificaCon     Vector  Space  Modeling     Latent  SemanCc  Indexing                        Demys&fying  Technology  Assisted  Review  
  • 20. StaCsCcal  Based  Technologies     Concept  -­‐  Clustering     Machine     Unsupervised     Quickly  understand      data     Uncontrolled  Clusters                        Demys&fying  Technology  Assisted  Review  
  • 21. StaCsCcal  Based  Technologies     Concept  -­‐  Categoriza&on       User  Created     Supervised     Control  Topics     Time  Consuming                        Demys&fying  Technology  Assisted  Review  
  • 22. StaCsCcal  Based  Technologies   Rules  Based  Systems     If..  Then…     If  email  =  person  1  to  person  2  then  return  it     If  email  =  person  1  or  person  2  then  return  it     ArCficial  Intelligence  Systems     EnCty  extracCon  (&  dicConaries)     Time  consuming     Mirror  human  thinking     Case,  subject  maRer     Transparent  System                        Demys&fying  Technology  Assisted  Review  
  • 23. StaCsCcal  Based  Technologies     Bayesian  ClassificaCon       ProbabilisCc     Co-­‐occurrence     Frequency     Spam  Filters     Viagra     Concepts     Words,  phrases                        Demys&fying  Technology  Assisted  Review  
  • 24. Bayesian       Bayesian  illustraCon     Baseball,  glove,  diamond,  bats,  hit,  home  run     Diamond,  pendant,  jewelry     Co-­‐occurrence     Local  –  within  a  document     Global  –  across  document  populaCon     Frequency  –  how  ozen  does  it  appear     WeighCng  –  uniqueness  counts                        Demys&fying  Technology  Assisted  Review  
  • 25. StaCsCcal  Based  Technologies     Vector  Space  Modeling     Latent  Seman&c  Indexing/Analysis     Words     Phrases,  Concepts     Tables     Algebraic  equaCons  represenCng  docs     WeighCng  Algorithms                        Demys&fying  Technology  Assisted  Review  
  • 26. Latent  SemanCc  Indexing  Example     Exclude  Noise  Words     The,  and,  or,  etc.     Vector  Space  Modeling     Build  Document  Profile     Diamond     Base,  ball     Necklace,  pendant     Diamond  Saw                        Demys&fying  Technology  Assisted  Review  
  • 27. MathemaCcal  FoundaCon     Tables  built  with  0s,  1s     Yes  it  has  that  word  or  phrase     No  it  doesn’t                        Demys&fying  Technology  Assisted  Review  
  • 28. Simple  Matrix  with  WeighCng                        Demys&fying  Technology  Assisted  Review  
  • 29. Weighted  by  Document  (not  just  type)                        Demys&fying  Technology  Assisted  Review  
  • 30. Defensibility  Report       Document,  Document,  Document     Transparency     Workflow     What  Was  Considered,  By  Whom?     QC  Process     Metrics                        Demys&fying  Technology  Assisted  Review  
  • 31. Q&A - Thank you!   Post  your  ques&ons  to  the   presenter  in  the  chat  secCon   Sonya  L.  Sigler   Vice  President,  Product  Strategy  &  Consul&ng   SFL  Data   415-­‐321-­‐8385   sonya@sfldata.com     www.sfldata.com                          Demys&fying  Technology  Assisted  Review