SlideShare a Scribd company logo
Cohasset Associates, Inc.

         Will Technology-Assisted Predictive Modeling and Auto-
                Classification End the ‘End-User’ Burden in
                          Records Management?

                             2012 Managing Electronic Records Conference
                                                              Chicago, IL
                                                             May 7, 2012

                                                               Jason R. Baron, Esq.
                                                                     Director of Litigation
                                                               Office of General Counsel
                                            National Archives and Records Administration

                                                                   Dave Lewis, Ph.D.
                                                           David D. Lewis Consulting, LLC
                                                                             Chicago, IL

         A New Era of Government
                “[P]roper records management is the backbone of open Government.”
                      President Obama’s Memorandum dated November 28, 2011
                                 re “Managing Government Records”

2012 Managing Electronic Records Conference                                                           6.1
Cohasset Associates, Inc.

        The era of Big Data has just
          Lehman Brothers Investigation
             -- 350 billion page universe (3 petabytes)
             -- Examiner narrowed collection by selecting
          key custodians, using dozens of Boolean
             -- Reviewed 5 million docs (40 million pages
          using 70 contract attorneys)
          Source: Report of Anton R. Valukas, Examiner, In re Lehman Brothers Holdings Inc., et al., Chapter 11
          Case No. 08-13555 (U.S. Bankruptcy Ct. S.D.N.Y. March 11, 2010), Vol. 7, Appx. 5, at

        Process Optimization Problem 1: The
        transactional toll of user-based
        recordkeeping schemes (“as is” RM)


        …. and the need for
        better, automated solutions ….


2012 Managing Electronic Records Conference                                                                                   6.2
Cohasset Associates, Inc.

        Impact of Technology on E-Records
        Management: Snapshot 2012 (“As is”)
           A universe of proprietary products exists in the
            marketplace: document management and
            records management applications (RMAs)
           DoD 5015.2 version 3 compliant products
           However, scalability issues exist
           Agencies must prepare to confront significant
            front-end process issues when transitioning to
            electronic recordkeeping
           Records schedule simplification is key


        RM wish list for 2012….
           RM’s “easy button”: the elusive goal of zero
            extra keystrokes to comply with RM
            requirements (capture)
           A technology app that automatically tags
            records in compliance with RM policies and
            practices (categorize)
           Supervised learning RM with minimal records
            officer or end user involvement (learn)
           Rule-based and role-based RM
           Advanced search                                    8

        Electronic Archiving As The
        First Step
           What is it?
            100% snapshot of (typically) email, plus in some
            cases other selected ESI applications
           How does it differ from an RMA?
            Goal is of preservation of evidence, not records
            management per se
           NARA Bulletin 2008-05


2012 Managing Electronic Records Conference                                6.3
Cohasset Associates, Inc.

        A Possible Path Forward?
           Email archiving in short term, synced to existing
            proprietary software on email system
           Designation of key senior officials as creating
            permanent records, consistent with existing records
           Additional designations of permanent records by
            agency component
           “Smart” filters/categorical rules built in based on
            content, to the extent feasible to do
           Default are records in designated temporary record
            buckets, disposed of under existing records

         A pyramid approach combines disposition policy with automated
         tools    to    bring     FRA      email     under      records
         management, preservation, and access
                      = permanent or top
                                                                  = temporary or staff and support


         The position of the “set-point” for email capture depends on policy and resources:
         setting it higher allows use of tools now available to get 100% of email at lower
         volumes;* setting it lower means more records will be captured and smarter tools
         are needed to distinguish and disposition temporary- and non-record.

         Implementing an email archiving policy is feasible now, since tools are readily
         available to capture 100% of email traffic at the individual or organizational level, in
         formats that can be archived.

         A pyramid approach combines disposition policy with automated
         tools    to    bring     FRA      email     under      records
         management, preservation, and access
                      = permanent or top
                                                                  = temporary or staff and support


         The position of the “set-point” for email capture depends on policy and resources:
         setting it higher allows use of tools now available to get 100% of email at lower
         volumes;* setting it lower means more records will be captured and smarter tools
         are needed to distinguish and disposition temporary- and non-record.

         Implementing an email archiving policy is feasible now, since tools are readily
         available to capture 100% of email traffic at the individual or organizational level, in
         formats that can be archived.

2012 Managing Electronic Records Conference                                                                       6.4
Cohasset Associates, Inc.

        How To Avoid A Train Wreck
        With Email Archiving….

                     Capture E-mail But Utilize Records Management!

        Functional Requirements for
        Categorization Products in the Federal

           Ease of use …. Scalability …. Archiving in native
           formats….. Metadata preservation … Seamless integration
           with existing software apps …. Versioning …. Compatibility
           with big bucket records schedules …. Advanced search
           capabilities …. Ease of training / machine learning using
           records officers or end users …. Cost

        Process Optimization Problem 2: The
        Coming Age of Dark Archives (and the
        inability to provide access)


2012 Managing Electronic Records Conference                                          6.5
Cohasset Associates, Inc.

                 Emerging New Strategies:
                  “Predictive Analytics”

        Improved review and case
        assessment: cluster docs
        thru use of software with
        minimal human
        intervention at front end to             Slide adapted from Gartner
                                                 Conference                       16
        code “seeded” data set                   June 23, 2010 Washington, D.C.

                      Language Processing
                             Retrieval / Search                      2.
        Information            Classification                             1.
                           Question Answering
                            Entity Recognition
                          Information Extraction              Natural
                           Machine Translation                Processing

         Text Classification
            Deciding which of
             several groups a text
             belongs to
            Crudest form of
                ...but often can be automated
                 with high accuracy


2012 Managing Electronic Records Conference                                                    6.6
Cohasset Associates, Inc.

                       Why Classify?
         Reduce                                     an action for
         infinite                                   every
                               set of
         variety of                                 possible
         text...                                    input.


        Other Advantages of Text
           Supervised learning:
               Classifiers (rules) can be
                learned by imitating manual

           Straightforward numerical
            measures of quality                                 recall: 85% +/- 4%
                                                                precision: 75% +/- 3%

           Objective reason why a
            decision was made                          classification


        Variations on Classification
           Binary vs. multiclass

           Hierarchical

           Probabilistic      83%            17%

           Graded / ordered / fuzzy

2012 Managing Electronic Records Conference                                                      6.7
Cohasset Associates, Inc.

        Defining Sets of Classes
           Tradeoff among
               Ideal classes to
               Classes you can teach

                people to assign
                Classes you can
                teachsoftwareto assign
           Be skeptical of automatic
            discovery of classes

        Text Retrieval Systems
           AKA search engines,
            databases, text
            databases, etc.
            databases etc


                Classification              Search

                  autonomous             interactive

                     long term              transitory

                  organizational         personal

                    structured              independent   ? ?


2012 Managing Electronic Records Conference                                  6.8
Cohasset Associates, Inc.

        Some Distinctions Among
        Search Approaches
           Exact Match vs.
            Ranked Retrieval vs.
            Browsing                      vs.
           Text Representations

           Matching Aids


        Exact Match Search
           Query specifies conditions
            document must meet           budget AND Knoxville
                                         AND (revised or preliminary)
           Variants
               Boolean
                B l
               SQL
               Faceted
           Often (ambiguously) called
            "keyword" search


        A Faceted Search Interface


2012 Managing Electronic Records Conference                                      6.9
Cohasset Associates, Inc.

        Ranked Retrieval
           Query specifies important
            attributes of desired
           System statistically weights
            those attributes
           Results returned in order of
            strength of match


        Statistical Evidence in Ranked
           Corpus statistics
               Word (and metadata) counts
           Unsupervised learning
               Clustering, LSI/LSA etc.
                Cl t i      LSI/LSA, t
               finds (maybe useless) patterns
           Supervised learning
               aka "relevance feedback"
               learn indicators of user interest


           Hierarchies
           Networks
           Clusters
           Spaces / Maps / Dimensions
               make great pictures / demos
               unclear if useful for finding information


2012 Managing Electronic Records Conference                              6.10
Cohasset Associates, Inc.

        Visual Analysis Examples
        (Presentation by Dr. Victoria Lemieux, Univ. British Columbia,
        at Society of American Archivist Annual Mtg. 2010, Washington, D.C.)

                    With acknowledgments to Jeffrey Heer, Exploring Enron,,
                             Adam Perer, Contrasting Portraits,,             31
                             and Fernanda Viegas, Email Conversations,


2012 Managing Electronic Records Conference                                                                                              6.11
Cohasset Associates, Inc.

        What Evidence Can The
        Search Software Use?
           Words, phrases, etc.
           Manually assigned categories
           Metadata
               Author, organization, creation date, change
                date, access date, length, file type,...
           Contextual information (links,


        What Resources Aid
           Linguistic analysis
               At word level or higher
           Clusters / spaces / ...
           Thesauri / semantic nets /
            concept maps / ...
               Suited to your task?
               Modifiable?
               How is text determined to
                belong to category?

        Concepts v. Keywords
        Supreme Court of Information Retrieval, Case No. 1-tfidf-0-2902, 2009

           Search software marketing:
               Them = keyword search = bad
               Us = concept search = good
           Reality:
            R lit
               Both terms have referred to dozens of
                different technologies...
               ...including some of the same ones!
           Conceptual search is an aspiration, not
            a technology

2012 Managing Electronic Records Conference                                                  6.12
Cohasset Associates, Inc.

            Example of Boolean search string
            from U.S. v. Philip Morris

            (((master settlement agreement OR msa) AND NOT (medical
             savings account OR metropolitan standard area)) OR s. 1415
             OR (ets AND NOT educational testing service) OR (liggett
             AND NOT sharon a. liggett) OR atco OR lorillard OR (pmi
             AND NOT presidential management intern) OR pm usa OR
             rjr OR (b&w AND NOT photo*) OR phillip morris OR batco
             OR ftc test method OR star scientific OR vector group OR
             joe camel OR (marlboro AND NOT upper marlboro)) AND
             NOT (tobacco* OR cigarette* OR smoking OR tar OR
             nicotine OR smokeless OR synar amendment OR philip
             morris OR r.j. reynolds OR ("brown and williamson") OR
             ("brown & williamson") OR bat industries OR liggett group)


         U.S. v. Philip Morris E-mail Winnowing

            20 million  200,000  100,000          80,000     20,000
            email        hits based relevant         produced    placed on
            records      on keyword emails           to opposing privilege
                         terms used                  party       logs
                         (1%)

             A PROBLEM: only a handful entered as exhibits at trial
             A BIGGER PROGLEM: the 1% figure does not scale


        Judicial endorsement of predictive analytics
        in document review by Judge Peck in Da
        Silva Moore v. PublicisGroupe(SDNY Feb.
        24, 2012)
             This opinion appears to be the first in which a Court
             has approved of the use of computer-assisted review.
                    pp                          p
             . . . What the Bar should take away from this Opinion
             is that computer-assisted review is an available tool
             and should be seriously considered for use in large-
             data-volume cases where it may save the producing
             party (or both parties) significant amounts of legal
             fees in document review. Counsel no longer have to
             worry about being the ‘first’ or ‘guinea pig’ for judicial
             acceptance of computer-assisted review . . .
             Computer-assisted review can now be considered
             judicially-approved for use in appropriate cases.

2012 Managing Electronic Records Conference                                                6.13
Cohasset Associates, Inc.

              Social Networking/Links Analysis Example

                                            From Marc Smith
                                            Posted on Flickr                 40
                                            Under Creative Commons License

        Judicial second guessing of failure to use
        e-search capabilities: Capitol Records v.
        MP3 Tunes, 261 F.R.D. 44 (S.D.N.Y. 2009)

           “In [a prior case] the Court notes its dismay that the
            party opposing discovery of its ESI had organized its
            files in a manner which seemed to serve no purpose
            other than ‘to discourage audits. . .’ Similarly, in this
            case, [the party] host[ed] no ediscovery software on
            their servers and apparently are unable to conduct
            centralized email searches of groups of users
            without downloading them to a separate file and
            relying on the services of an outside vendor.”

        Judicial second guessing of failure to use
        e-search capabilities: Capitol Records v.
        MP3 Tunes (con’t)
        Court went on to add:
        “The day will undoubtedly will come when
          burden arguments based on a large
          organization’s lack of internal ediscovery
             g                                     y
          software will be received about as well as the
          contention that a party should be spared from
          retrieving paper documents because it had
          filed them sequentially, but in no apparent
          groupings, in an effort to avoid the added
          expense of file folders or indices.”

2012 Managing Electronic Records Conference                                               6.14
Cohasset Associates, Inc.

        Problem 3: Innovative


        The records management world of

        Background Law Review Referencing Autocategorization&
           Advanced Search
        J. Baron, “Law in the Age of Exabytes: Some Further Thoughts on
           ‘Information Inflation’ and Current Issues in E-Discovery
           Search, 17 Richmond J. Law & Technology (2011), see
           htt //l     i h     d d

        Latest “Predictive Coding” Case Law to follow in blogs online:
         Da Silva Moore v PublicisGroupe& MSL Group, 11 Civ. 1279
           (S.D.N.Y.) (Peck, M.J.) (Opinion dated Feb. 24 2012)
         Kleen Products, LLC v. Packaging Corp. of America, 10 C 5711
           (N.D. Ill.) (Nolan, M.J.)


2012 Managing Electronic Records Conference                                           6.15
Cohasset Associates, Inc.


           Jason R. Baron
         Director of Litigation
            Office of General Counsel
            National Archives and
            Records Administration

           (301) 837-1499


          Dave Lewis, Ph.D.
        David D. Lewis Consulting, LLC
          Chicago, IL



2012 Managing Electronic Records Conference             6.16

More Related Content

Similar to M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Document Management Techniques & Technologies
Document Management Techniques & TechnologiesDocument Management Techniques & Technologies
Document Management Techniques & Technologies
Gihan Wikramanayake
What Is Ecm?
What Is Ecm?What Is Ecm?
What Is Ecm?
What is-ecm-1227461596391360-9
What is-ecm-1227461596391360-9What is-ecm-1227461596391360-9
What is-ecm-1227461596391360-9
Govinda Sambamurthy
Cibm work shop 2chapter six
Cibm  work shop 2chapter sixCibm  work shop 2chapter six
Cibm work shop 2chapter six
Shaheen Khan
The Growing Email Archiving Dilemma
The Growing Email Archiving DilemmaThe Growing Email Archiving Dilemma
The Growing Email Archiving Dilemma
Gov civilworkshop
Gov civilworkshopGov civilworkshop
Gov civilworkshop
Christopher Wynder
A Pragmatic Strategy for Oracle Enterprise Content Management (ECM)
A Pragmatic Strategy for Oracle Enterprise Content Management (ECM)A Pragmatic Strategy for Oracle Enterprise Content Management (ECM)
A Pragmatic Strategy for Oracle Enterprise Content Management (ECM)
Brian Huff
DSS - ITSEC Conference - Protected-Networks - An Open Door May Tempt a Saint ...
DSS - ITSEC Conference - Protected-Networks - An Open Door May Tempt a Saint ...DSS - ITSEC Conference - Protected-Networks - An Open Door May Tempt a Saint ...
DSS - ITSEC Conference - Protected-Networks - An Open Door May Tempt a Saint ...
Andris Soroka
Jane report mam she
Jane report mam she Jane report mam she
Jane report mam she
Digital Destiny
Digital DestinyDigital Destiny
Digital Destiny
Brad Houston
LVA Electronic Records Management
LVA Electronic Records ManagementLVA Electronic Records Management
LVA Electronic Records Management
Paul Neal
Database System Concepts
Database System ConceptsDatabase System Concepts
Database System Concepts
Ranilesh Raveendran
Database Management System 1
Database Management System 1Database Management System 1
Database Management System 1
Prof. Erwin Globio
A Pragmatic Strategy for Oracle Enterprise Content Management
A Pragmatic Strategy for Oracle Enterprise Content ManagementA Pragmatic Strategy for Oracle Enterprise Content Management
A Pragmatic Strategy for Oracle Enterprise Content Management
Brian Huff
M12S17 - Big Data Requires Big ERM!
M12S17 - Big Data Requires Big ERM!M12S17 - Big Data Requires Big ERM!
M12S17 - Big Data Requires Big ERM!
MER Conference
The Case for NSF
The Case for NSFThe Case for NSF
The Case for NSF
Sherpa Software
Email Management & E-forms
Email Management & E-formsEmail Management & E-forms
Email Management & E-forms
Carol Hagen
New IM ToolBelt
New IM ToolBeltNew IM ToolBelt
New IM ToolBelt
Porter-Roth Associates
Ideate Framework WS-REST 2011
Ideate Framework  WS-REST 2011Ideate Framework  WS-REST 2011
Ideate Framework WS-REST 2011
Dave Duggal
IS 3003Chapter 61The Globe and MailIt is the.docx
IS 3003Chapter 61The Globe and MailIt is the.docxIS 3003Chapter 61The Globe and MailIt is the.docx
IS 3003Chapter 61The Globe and MailIt is the.docx

Similar to M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management? (20)

Document Management Techniques & Technologies
Document Management Techniques & TechnologiesDocument Management Techniques & Technologies
Document Management Techniques & Technologies
What Is Ecm?
What Is Ecm?What Is Ecm?
What Is Ecm?
What is-ecm-1227461596391360-9
What is-ecm-1227461596391360-9What is-ecm-1227461596391360-9
What is-ecm-1227461596391360-9
Cibm work shop 2chapter six
Cibm  work shop 2chapter sixCibm  work shop 2chapter six
Cibm work shop 2chapter six
The Growing Email Archiving Dilemma
The Growing Email Archiving DilemmaThe Growing Email Archiving Dilemma
The Growing Email Archiving Dilemma
Gov civilworkshop
Gov civilworkshopGov civilworkshop
Gov civilworkshop
A Pragmatic Strategy for Oracle Enterprise Content Management (ECM)
A Pragmatic Strategy for Oracle Enterprise Content Management (ECM)A Pragmatic Strategy for Oracle Enterprise Content Management (ECM)
A Pragmatic Strategy for Oracle Enterprise Content Management (ECM)
DSS - ITSEC Conference - Protected-Networks - An Open Door May Tempt a Saint ...
DSS - ITSEC Conference - Protected-Networks - An Open Door May Tempt a Saint ...DSS - ITSEC Conference - Protected-Networks - An Open Door May Tempt a Saint ...
DSS - ITSEC Conference - Protected-Networks - An Open Door May Tempt a Saint ...
Jane report mam she
Jane report mam she Jane report mam she
Jane report mam she
Digital Destiny
Digital DestinyDigital Destiny
Digital Destiny
LVA Electronic Records Management
LVA Electronic Records ManagementLVA Electronic Records Management
LVA Electronic Records Management
Database System Concepts
Database System ConceptsDatabase System Concepts
Database System Concepts
Database Management System 1
Database Management System 1Database Management System 1
Database Management System 1
A Pragmatic Strategy for Oracle Enterprise Content Management
A Pragmatic Strategy for Oracle Enterprise Content ManagementA Pragmatic Strategy for Oracle Enterprise Content Management
A Pragmatic Strategy for Oracle Enterprise Content Management
M12S17 - Big Data Requires Big ERM!
M12S17 - Big Data Requires Big ERM!M12S17 - Big Data Requires Big ERM!
M12S17 - Big Data Requires Big ERM!
The Case for NSF
The Case for NSFThe Case for NSF
The Case for NSF
Email Management & E-forms
Email Management & E-formsEmail Management & E-forms
Email Management & E-forms
New IM ToolBelt
New IM ToolBeltNew IM ToolBelt
New IM ToolBelt
Ideate Framework WS-REST 2011
Ideate Framework  WS-REST 2011Ideate Framework  WS-REST 2011
Ideate Framework WS-REST 2011
IS 3003Chapter 61The Globe and MailIt is the.docx
IS 3003Chapter 61The Globe and MailIt is the.docxIS 3003Chapter 61The Globe and MailIt is the.docx
IS 3003Chapter 61The Globe and MailIt is the.docx

More from MER Conference

M12S23 - Right-sizing Your Information Footprint by Chucking Your Dead Data
M12S23 - Right-sizing Your Information Footprint by Chucking Your Dead DataM12S23 - Right-sizing Your Information Footprint by Chucking Your Dead Data
M12S23 - Right-sizing Your Information Footprint by Chucking Your Dead Data
MER Conference
M12S21 - "Corporate Alzheimer's": The Impending Crisis in Accessing Digital R...
M12S21 - "Corporate Alzheimer's": The Impending Crisis in Accessing Digital R...M12S21 - "Corporate Alzheimer's": The Impending Crisis in Accessing Digital R...
M12S21 - "Corporate Alzheimer's": The Impending Crisis in Accessing Digital R...
MER Conference
M12S19 - S19 - CASE STUDY: e-RIM Success with Structured Data Systems
 M12S19 - S19 - CASE STUDY: e-RIM Success with Structured Data Systems M12S19 - S19 - CASE STUDY: e-RIM Success with Structured Data Systems
M12S19 - S19 - CASE STUDY: e-RIM Success with Structured Data Systems
MER Conference
M12S18 - Records and Information Management: What Healthcare Should be Learni...
M12S18 - Records and Information Management: What Healthcare Should be Learni...M12S18 - Records and Information Management: What Healthcare Should be Learni...
M12S18 - Records and Information Management: What Healthcare Should be Learni...
MER Conference
M12S15 - CASE STUDY: Spoliation - The Actual Case As It Was To Be Argued in ...
 M12S15 - CASE STUDY: Spoliation - The Actual Case As It Was To Be Argued in ... M12S15 - CASE STUDY: Spoliation - The Actual Case As It Was To Be Argued in ...
M12S15 - CASE STUDY: Spoliation - The Actual Case As It Was To Be Argued in ...
MER Conference
M12S13 - RIM for the Next Generation: A Call to Action
 M12S13 - RIM for the Next Generation: A Call to Action M12S13 - RIM for the Next Generation: A Call to Action
M12S13 - RIM for the Next Generation: A Call to Action
MER Conference
M12S11 - The Do's and Don'ts of Managing Social Media
 M12S11 - The Do's and Don'ts of Managing Social Media M12S11 - The Do's and Don'ts of Managing Social Media
M12S11 - The Do's and Don'ts of Managing Social Media
MER Conference
M12S01 - The Information Tsunami: Where We Are and How to Move Forward
M12S01 - The Information Tsunami: Where We Are and How to Move ForwardM12S01 - The Information Tsunami: Where We Are and How to Move Forward
M12S01 - The Information Tsunami: Where We Are and How to Move Forward
MER Conference
M12S09 - ERM Case Law: The Latest News, Trends, and Issues
M12S09 - ERM Case Law: The Latest News, Trends, and IssuesM12S09 - ERM Case Law: The Latest News, Trends, and Issues
M12S09 - ERM Case Law: The Latest News, Trends, and Issues
MER Conference
M12S08 - Transforming RIM to 'Responsible Information Management'
M12S08 - Transforming RIM to 'Responsible Information Management'M12S08 - Transforming RIM to 'Responsible Information Management'
M12S08 - Transforming RIM to 'Responsible Information Management'
MER Conference
M12S05 - CASE STUDY: Leveraging Content Analytics to Kick-Start your Informat...
M12S05 - CASE STUDY: Leveraging Content Analytics to Kick-Start your Informat...M12S05 - CASE STUDY: Leveraging Content Analytics to Kick-Start your Informat...
M12S05 - CASE STUDY: Leveraging Content Analytics to Kick-Start your Informat...
MER Conference
M12S02 - ERM Software: Historic Timeline, Lessons Learned, Current Issues, Fu...
M12S02 - ERM Software: Historic Timeline, Lessons Learned, Current Issues, Fu...M12S02 - ERM Software: Historic Timeline, Lessons Learned, Current Issues, Fu...
M12S02 - ERM Software: Historic Timeline, Lessons Learned, Current Issues, Fu...
MER Conference
M12S07 - Retention & ESI - Paths to Success - Part Two
M12S07 - Retention & ESI - Paths to Success - Part TwoM12S07 - Retention & ESI - Paths to Success - Part Two
M12S07 - Retention & ESI - Paths to Success - Part Two
MER Conference

More from MER Conference (13)

M12S23 - Right-sizing Your Information Footprint by Chucking Your Dead Data
M12S23 - Right-sizing Your Information Footprint by Chucking Your Dead DataM12S23 - Right-sizing Your Information Footprint by Chucking Your Dead Data
M12S23 - Right-sizing Your Information Footprint by Chucking Your Dead Data
M12S21 - "Corporate Alzheimer's": The Impending Crisis in Accessing Digital R...
M12S21 - "Corporate Alzheimer's": The Impending Crisis in Accessing Digital R...M12S21 - "Corporate Alzheimer's": The Impending Crisis in Accessing Digital R...
M12S21 - "Corporate Alzheimer's": The Impending Crisis in Accessing Digital R...
M12S19 - S19 - CASE STUDY: e-RIM Success with Structured Data Systems
 M12S19 - S19 - CASE STUDY: e-RIM Success with Structured Data Systems M12S19 - S19 - CASE STUDY: e-RIM Success with Structured Data Systems
M12S19 - S19 - CASE STUDY: e-RIM Success with Structured Data Systems
M12S18 - Records and Information Management: What Healthcare Should be Learni...
M12S18 - Records and Information Management: What Healthcare Should be Learni...M12S18 - Records and Information Management: What Healthcare Should be Learni...
M12S18 - Records and Information Management: What Healthcare Should be Learni...
M12S15 - CASE STUDY: Spoliation - The Actual Case As It Was To Be Argued in ...
 M12S15 - CASE STUDY: Spoliation - The Actual Case As It Was To Be Argued in ... M12S15 - CASE STUDY: Spoliation - The Actual Case As It Was To Be Argued in ...
M12S15 - CASE STUDY: Spoliation - The Actual Case As It Was To Be Argued in ...
M12S13 - RIM for the Next Generation: A Call to Action
 M12S13 - RIM for the Next Generation: A Call to Action M12S13 - RIM for the Next Generation: A Call to Action
M12S13 - RIM for the Next Generation: A Call to Action
M12S11 - The Do's and Don'ts of Managing Social Media
 M12S11 - The Do's and Don'ts of Managing Social Media M12S11 - The Do's and Don'ts of Managing Social Media
M12S11 - The Do's and Don'ts of Managing Social Media
M12S01 - The Information Tsunami: Where We Are and How to Move Forward
M12S01 - The Information Tsunami: Where We Are and How to Move ForwardM12S01 - The Information Tsunami: Where We Are and How to Move Forward
M12S01 - The Information Tsunami: Where We Are and How to Move Forward
M12S09 - ERM Case Law: The Latest News, Trends, and Issues
M12S09 - ERM Case Law: The Latest News, Trends, and IssuesM12S09 - ERM Case Law: The Latest News, Trends, and Issues
M12S09 - ERM Case Law: The Latest News, Trends, and Issues
M12S08 - Transforming RIM to 'Responsible Information Management'
M12S08 - Transforming RIM to 'Responsible Information Management'M12S08 - Transforming RIM to 'Responsible Information Management'
M12S08 - Transforming RIM to 'Responsible Information Management'
M12S05 - CASE STUDY: Leveraging Content Analytics to Kick-Start your Informat...
M12S05 - CASE STUDY: Leveraging Content Analytics to Kick-Start your Informat...M12S05 - CASE STUDY: Leveraging Content Analytics to Kick-Start your Informat...
M12S05 - CASE STUDY: Leveraging Content Analytics to Kick-Start your Informat...
M12S02 - ERM Software: Historic Timeline, Lessons Learned, Current Issues, Fu...
M12S02 - ERM Software: Historic Timeline, Lessons Learned, Current Issues, Fu...M12S02 - ERM Software: Historic Timeline, Lessons Learned, Current Issues, Fu...
M12S02 - ERM Software: Historic Timeline, Lessons Learned, Current Issues, Fu...
M12S07 - Retention & ESI - Paths to Success - Part Two
M12S07 - Retention & ESI - Paths to Success - Part TwoM12S07 - Retention & ESI - Paths to Success - Part Two
M12S07 - Retention & ESI - Paths to Success - Part Two

Recently uploaded

Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
Katrina Pritchard
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Denish Jangid
MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025
The basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptxThe basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptx
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
Mohammad Al-Dhahabi
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17
Celine George
Electric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger HuntElectric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger Hunt
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Henry Hollis
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx

Recently uploaded (20)

Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
Traditional Musical Instruments of Arunachal Pradesh and Uttar Pradesh - RAYH...
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025MDP on air pollution of class 8 year 2024-2025
MDP on air pollution of class 8 year 2024-2025
The basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptxThe basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptx
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17
Electric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger HuntElectric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger Hunt
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx

M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

  • 1. Cohasset Associates, Inc. NOTES Will Technology-Assisted Predictive Modeling and Auto- Classification End the ‘End-User’ Burden in Records Management? 2012 Managing Electronic Records Conference Chicago, IL g May 7, 2012 Jason R. Baron, Esq. Director of Litigation Office of General Counsel National Archives and Records Administration Dave Lewis, Ph.D. David D. Lewis Consulting, LLC Chicago, IL A New Era of Government “[P]roper records management is the backbone of open Government.” President Obama’s Memorandum dated November 28, 2011 re “Managing Government Records” managing-government-records 2012 Managing Electronic Records Conference 6.1
  • 2. Cohasset Associates, Inc. NOTES Reality: The era of Big Data has just begun…. Lehman Brothers Investigation -- 350 billion page universe (3 petabytes) -- Examiner narrowed collection by selecting key custodians, using dozens of Boolean searches -- Reviewed 5 million docs (40 million pages using 70 contract attorneys) Source: Report of Anton R. Valukas, Examiner, In re Lehman Brothers Holdings Inc., et al., Chapter 11 Case No. 08-13555 (U.S. Bankruptcy Ct. S.D.N.Y. March 11, 2010), Vol. 7, Appx. 5, at Process Optimization Problem 1: The transactional toll of user-based recordkeeping schemes (“as is” RM) 5 …. and the need for better, automated solutions …. 6 2012 Managing Electronic Records Conference 6.2
  • 3. Cohasset Associates, Inc. NOTES Impact of Technology on E-Records Management: Snapshot 2012 (“As is”)  A universe of proprietary products exists in the marketplace: document management and records management applications (RMAs)  DoD 5015.2 version 3 compliant products  However, scalability issues exist  Agencies must prepare to confront significant front-end process issues when transitioning to electronic recordkeeping  Records schedule simplification is key 7 RM wish list for 2012….  RM’s “easy button”: the elusive goal of zero extra keystrokes to comply with RM requirements (capture)  A technology app that automatically tags records in compliance with RM policies and practices (categorize)  Supervised learning RM with minimal records officer or end user involvement (learn)  Rule-based and role-based RM  Advanced search 8 Electronic Archiving As The First Step  What is it? 100% snapshot of (typically) email, plus in some cases other selected ESI applications  How does it differ from an RMA? Goal is of preservation of evidence, not records management per se  NARA Bulletin 2008-05 9 2012 Managing Electronic Records Conference 6.3
  • 4. Cohasset Associates, Inc. NOTES A Possible Path Forward?  Email archiving in short term, synced to existing proprietary software on email system  Designation of key senior officials as creating permanent records, consistent with existing records schedules  Additional designations of permanent records by agency component  “Smart” filters/categorical rules built in based on content, to the extent feasible to do  Default are records in designated temporary record buckets, disposed of under existing records schedules. 10 A pyramid approach combines disposition policy with automated tools to bring FRA email under records management, preservation, and access = permanent or top = temporary or staff and support officials slider The position of the “set-point” for email capture depends on policy and resources: setting it higher allows use of tools now available to get 100% of email at lower volumes;* setting it lower means more records will be captured and smarter tools are needed to distinguish and disposition temporary- and non-record. Implementing an email archiving policy is feasible now, since tools are readily available to capture 100% of email traffic at the individual or organizational level, in formats that can be archived. A pyramid approach combines disposition policy with automated tools to bring FRA email under records management, preservation, and access = permanent or top = temporary or staff and support officials slider The position of the “set-point” for email capture depends on policy and resources: setting it higher allows use of tools now available to get 100% of email at lower volumes;* setting it lower means more records will be captured and smarter tools are needed to distinguish and disposition temporary- and non-record. Implementing an email archiving policy is feasible now, since tools are readily available to capture 100% of email traffic at the individual or organizational level, in formats that can be archived. 2012 Managing Electronic Records Conference 6.4
  • 5. Cohasset Associates, Inc. NOTES How To Avoid A Train Wreck With Email Archiving…. Capture E-mail But Utilize Records Management! 13 Functional Requirements for Categorization Products in the Federal workplace Ease of use …. Scalability …. Archiving in native formats….. Metadata preservation … Seamless integration with existing software apps …. Versioning …. Compatibility with big bucket records schedules …. Advanced search capabilities …. Ease of training / machine learning using records officers or end users …. Cost Process Optimization Problem 2: The Coming Age of Dark Archives (and the inability to provide access) 15 2012 Managing Electronic Records Conference 6.5
  • 6. Cohasset Associates, Inc. NOTES Emerging New Strategies: “Predictive Analytics” Improved review and case assessment: cluster docs thru use of software with minimal human intervention at front end to Slide adapted from Gartner Conference 16 code “seeded” data set June 23, 2010 Washington, D.C. Language Processing Technologies Retrieval / Search 2. Information Classification 1. Retrieval Question Answering Summarization Entity Recognition Information Extraction Natural Language Machine Translation Processing : 17 Text Classification  Deciding which of several groups a text belongs to  Crudest form of language understanding...  ...but often can be automated with high accuracy 18 2012 Managing Electronic Records Conference 6.6
  • 7. Cohasset Associates, Inc. NOTES Why Classify? specify Reduce an action for finite infinite every set of variety of possible classes... text... input. 19 Other Advantages of Text Classification  Supervised learning:  Classifiers (rules) can be learned by imitating manual classifications  Straightforward numerical measures of quality recall: 85% +/- 4% precision: 75% +/- 3%  Objective reason why a decision was made classification rule 20 Variations on Classification  Binary vs. multiclass  Hierarchical  Probabilistic 83% 17%  Graded / ordered / fuzzy 21 2012 Managing Electronic Records Conference 6.7
  • 8. Cohasset Associates, Inc. NOTES Defining Sets of Classes  Tradeoff among  Ideal classes to implementpolicy  Classes you can teach  people to assign Classes you can ? teachsoftwareto assign  Be skeptical of automatic discovery of classes 22 Text Retrieval Systems  AKA search engines, semi-structured databases, text databases, etc. databases etc 23 Classification Search autonomous interactive long term transitory organizational personal structured independent ? ? ? 24 2012 Managing Electronic Records Conference 6.8
  • 9. Cohasset Associates, Inc. NOTES Some Distinctions Among Search Approaches  Exact Match vs. Ranked Retrieval vs.  "Concepts" Browsing vs. "Keywords" "Keywords"  Text Representations  Matching Aids 25 Exact Match Search  Query specifies conditions document must meet budget AND Knoxville AND (revised or preliminary)  Variants  Boolean B l  SQL  Faceted  Often (ambiguously) called "keyword" search 26 A Faceted Search Interface 27 2012 Managing Electronic Records Conference 6.9
  • 10. Cohasset Associates, Inc. NOTES Ranked Retrieval  Query specifies important attributes of desired documents  System statistically weights those attributes  Results returned in order of strength of match 28 Statistical Evidence in Ranked Retrieval  Corpus statistics  Word (and metadata) counts  Unsupervised learning  Clustering, LSI/LSA etc. Cl t i LSI/LSA, t  finds (maybe useless) patterns  Supervised learning  aka "relevance feedback"  learn indicators of user interest 29 Browsing  Hierarchies  Networks  Clusters  Spaces / Maps / Dimensions  make great pictures / demos  unclear if useful for finding information 30 2012 Managing Electronic Records Conference 6.10
  • 11. Cohasset Associates, Inc. NOTES Visual Analysis Examples (Presentation by Dr. Victoria Lemieux, Univ. British Columbia, at Society of American Archivist Annual Mtg. 2010, Washington, D.C.) With acknowledgments to Jeffrey Heer, Exploring Enron,, Adam Perer, Contrasting Portraits,, 31 and Fernanda Viegas, Email Conversations, 32 2012 Managing Electronic Records Conference 6.11
  • 12. Cohasset Associates, Inc. NOTES What Evidence Can The Search Software Use?  Words, phrases, etc.  Manually assigned categories  Metadata  Author, organization, creation date, change date, access date, length, file type,...  Contextual information (links, attachments,...) 34 What Resources Aid Matching?  Linguistic analysis  At word level or higher  Clusters / spaces / ...  Thesauri / semantic nets / concept maps / ...  Suited to your task?  Modifiable?  How is text determined to belong to category? 35 Concepts v. Keywords Supreme Court of Information Retrieval, Case No. 1-tfidf-0-2902, 2009  Search software marketing:  Them = keyword search = bad  Us = concept search = good  Reality: R lit  Both terms have referred to dozens of different technologies...  ...including some of the same ones!  Conceptual search is an aspiration, not a technology 36 2012 Managing Electronic Records Conference 6.12
  • 13. Cohasset Associates, Inc. NOTES Example of Boolean search string from U.S. v. Philip Morris  (((master settlement agreement OR msa) AND NOT (medical savings account OR metropolitan standard area)) OR s. 1415 OR (ets AND NOT educational testing service) OR (liggett AND NOT sharon a. liggett) OR atco OR lorillard OR (pmi AND NOT presidential management intern) OR pm usa OR rjr OR (b&w AND NOT photo*) OR phillip morris OR batco OR ftc test method OR star scientific OR vector group OR joe camel OR (marlboro AND NOT upper marlboro)) AND NOT (tobacco* OR cigarette* OR smoking OR tar OR nicotine OR smokeless OR synar amendment OR philip morris OR r.j. reynolds OR ("brown and williamson") OR ("brown & williamson") OR bat industries OR liggett group) 37 U.S. v. Philip Morris E-mail Winnowing Process  20 million  200,000  100,000  80,000  20,000  email hits based relevant produced placed on  records on keyword emails to opposing privilege  terms used party logs  (1%)   A PROBLEM: only a handful entered as exhibits at trial   A BIGGER PROGLEM: the 1% figure does not scale 38 Judicial endorsement of predictive analytics in document review by Judge Peck in Da Silva Moore v. PublicisGroupe(SDNY Feb. 24, 2012) This opinion appears to be the first in which a Court has approved of the use of computer-assisted review. pp p . . . What the Bar should take away from this Opinion is that computer-assisted review is an available tool and should be seriously considered for use in large- data-volume cases where it may save the producing party (or both parties) significant amounts of legal fees in document review. Counsel no longer have to worry about being the ‘first’ or ‘guinea pig’ for judicial acceptance of computer-assisted review . . . Computer-assisted review can now be considered judicially-approved for use in appropriate cases. 2012 Managing Electronic Records Conference 6.13
  • 14. Cohasset Associates, Inc. NOTES Social Networking/Links Analysis Example From Marc Smith Posted on Flickr 40 Under Creative Commons License Judicial second guessing of failure to use e-search capabilities: Capitol Records v. MP3 Tunes, 261 F.R.D. 44 (S.D.N.Y. 2009)  “In [a prior case] the Court notes its dismay that the party opposing discovery of its ESI had organized its files in a manner which seemed to serve no purpose other than ‘to discourage audits. . .’ Similarly, in this case, [the party] host[ed] no ediscovery software on their servers and apparently are unable to conduct centralized email searches of groups of users without downloading them to a separate file and relying on the services of an outside vendor.” 41 Judicial second guessing of failure to use e-search capabilities: Capitol Records v. MP3 Tunes (con’t) Court went on to add: “The day will undoubtedly will come when burden arguments based on a large organization’s lack of internal ediscovery g y software will be received about as well as the contention that a party should be spared from retrieving paper documents because it had filed them sequentially, but in no apparent groupings, in an effort to avoid the added expense of file folders or indices.” 42 2012 Managing Electronic Records Conference 6.14
  • 15. Cohasset Associates, Inc. NOTES Problem 3: Innovative Thinking 43 The records management world of tomorrow…. References Background Law Review Referencing Autocategorization& Advanced Search J. Baron, “Law in the Age of Exabytes: Some Further Thoughts on ‘Information Inflation’ and Current Issues in E-Discovery Search, 17 Richmond J. Law & Technology (2011), see htt //l i h d d Latest “Predictive Coding” Case Law to follow in blogs online:  Da Silva Moore v PublicisGroupe& MSL Group, 11 Civ. 1279 (S.D.N.Y.) (Peck, M.J.) (Opinion dated Feb. 24 2012)  Kleen Products, LLC v. Packaging Corp. of America, 10 C 5711 (N.D. Ill.) (Nolan, M.J.) 45 2012 Managing Electronic Records Conference 6.15
  • 16. Cohasset Associates, Inc. NOTES Jason R. Baron Director of Litigation g Office of General Counsel National Archives and Records Administration (301) 837-1499 Email: 46 Dave Lewis, Ph.D. David D. Lewis Consulting, LLC Chicago, IL Email: http// 47 2012 Managing Electronic Records Conference 6.16