Your SlideShare is downloading. ×
M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

154
views

Published on

From the MER Conference 2012 …

From the MER Conference 2012

Seakers: Jason R. Baron, Esq. Dave Lewis, Ph.D.

2012 is the year we will see great strides by information professionals in using automation (in the form of "predictive" and "technology-assisted" search, filtering, and auto-classification) for the purpose of achieving efficiencies and cutting costs in records management as well as in legal settings.

The strategic use of these new methods is absolutely necessary given the massive, exponential increases in electronically stored information - in the form of records within corporate networks and repositories.

This session addresses the latest technological developments from the two perspectives:

- A longtime advocate of smart technology in the public recordkeeping sector, and
- A leading information scientist.

The session includes a state of the art overview of the latest developments in technology-assisted review, with an emphasis on how these technologies can and will enhance electronic records management by helping to end the era of excessive reliance on end user RM.

You will learn:

- What technology-assisted review and predictive analytics are all about using advanced search, filtering, and auto-classification as part of a defensible electronic records management program.
- How these technologies also add value to overall corporate information governance.

Published in: Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
154
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Cohasset Associates, Inc. NOTES Will Technology-Assisted Predictive Modeling and Auto- Classification End the ‘End-User’ Burden in Records Management? 2012 Managing Electronic Records Conference Chicago, IL g May 7, 2012 Jason R. Baron, Esq. Director of Litigation Office of General Counsel National Archives and Records Administration Dave Lewis, Ph.D. David D. Lewis Consulting, LLC Chicago, IL A New Era of Government “[P]roper records management is the backbone of open Government.” President Obama’s Memorandum dated November 28, 2011 re “Managing Government Records” http://www.whitehouse.gov/the-press-office/2011/11/28/presidential-memorandum- managing-government-records2012 Managing Electronic Records Conference 6.1
  • 2. Cohasset Associates, Inc. NOTES Reality: The era of Big Data has just begun…. Lehman Brothers Investigation -- 350 billion page universe (3 petabytes) -- Examiner narrowed collection by selecting key custodians, using dozens of Boolean searches -- Reviewed 5 million docs (40 million pages using 70 contract attorneys) Source: Report of Anton R. Valukas, Examiner, In re Lehman Brothers Holdings Inc., et al., Chapter 11 Case No. 08-13555 (U.S. Bankruptcy Ct. S.D.N.Y. March 11, 2010), Vol. 7, Appx. 5, at http://lehmanreport.jenner.com/. Process Optimization Problem 1: The transactional toll of user-based recordkeeping schemes (“as is” RM) 5 …. and the need for better, automated solutions …. 62012 Managing Electronic Records Conference 6.2
  • 3. Cohasset Associates, Inc. NOTES Impact of Technology on E-Records Management: Snapshot 2012 (“As is”)  A universe of proprietary products exists in the marketplace: document management and records management applications (RMAs)  DoD 5015.2 version 3 compliant products  However, scalability issues exist  Agencies must prepare to confront significant front-end process issues when transitioning to electronic recordkeeping  Records schedule simplification is key 7 RM wish list for 2012….  RM’s “easy button”: the elusive goal of zero extra keystrokes to comply with RM requirements (capture)  A technology app that automatically tags records in compliance with RM policies and practices (categorize)  Supervised learning RM with minimal records officer or end user involvement (learn)  Rule-based and role-based RM  Advanced search 8 Electronic Archiving As The First Step  What is it? 100% snapshot of (typically) email, plus in some cases other selected ESI applications  How does it differ from an RMA? Goal is of preservation of evidence, not records management per se  NARA Bulletin 2008-05 92012 Managing Electronic Records Conference 6.3
  • 4. Cohasset Associates, Inc. NOTES A Possible Path Forward?  Email archiving in short term, synced to existing proprietary software on email system  Designation of key senior officials as creating permanent records, consistent with existing records schedules  Additional designations of permanent records by agency component  “Smart” filters/categorical rules built in based on content, to the extent feasible to do  Default are records in designated temporary record buckets, disposed of under existing records schedules. 10 A pyramid approach combines disposition policy with automated tools to bring FRA email under records management, preservation, and access = permanent or top = temporary or staff and support officials slider The position of the “set-point” for email capture depends on policy and resources: setting it higher allows use of tools now available to get 100% of email at lower volumes;* setting it lower means more records will be captured and smarter tools are needed to distinguish and disposition temporary- and non-record. Implementing an email archiving policy is feasible now, since tools are readily available to capture 100% of email traffic at the individual or organizational level, in formats that can be archived. A pyramid approach combines disposition policy with automated tools to bring FRA email under records management, preservation, and access = permanent or top = temporary or staff and support officials slider The position of the “set-point” for email capture depends on policy and resources: setting it higher allows use of tools now available to get 100% of email at lower volumes;* setting it lower means more records will be captured and smarter tools are needed to distinguish and disposition temporary- and non-record. Implementing an email archiving policy is feasible now, since tools are readily available to capture 100% of email traffic at the individual or organizational level, in formats that can be archived.2012 Managing Electronic Records Conference 6.4
  • 5. Cohasset Associates, Inc. NOTES How To Avoid A Train Wreck With Email Archiving…. Capture E-mail But Utilize Records Management! 13 Functional Requirements for Categorization Products in the Federal workplace Ease of use …. Scalability …. Archiving in native formats….. Metadata preservation … Seamless integration with existing software apps …. Versioning …. Compatibility with big bucket records schedules …. Advanced search capabilities …. Ease of training / machine learning using records officers or end users …. Cost Process Optimization Problem 2: The Coming Age of Dark Archives (and the inability to provide access) 152012 Managing Electronic Records Conference 6.5
  • 6. Cohasset Associates, Inc. NOTES Emerging New Strategies: “Predictive Analytics” Improved review and case assessment: cluster docs thru use of software with minimal human intervention at front end to Slide adapted from Gartner Conference 16 code “seeded” data set June 23, 2010 Washington, D.C. Language Processing Technologies Retrieval / Search 2. Information Classification 1. Retrieval Question Answering Summarization Entity Recognition Information Extraction Natural Language Machine Translation Processing : 17 Text Classification  Deciding which of several groups a text belongs to  Crudest form of language understanding...  ...but often can be automated with high accuracy 182012 Managing Electronic Records Conference 6.6
  • 7. Cohasset Associates, Inc. NOTES Why Classify? ...to specify Reduce an action for ...to finite infinite every set of variety of possible classes... text... input. 19 Other Advantages of Text Classification  Supervised learning:  Classifiers (rules) can be learned by imitating manual classifications  Straightforward numerical measures of quality recall: 85% +/- 4% precision: 75% +/- 3%  Objective reason why a decision was made classification rule 20 Variations on Classification  Binary vs. multiclass  Hierarchical  Probabilistic 83% 17%  Graded / ordered / fuzzy 212012 Managing Electronic Records Conference 6.7
  • 8. Cohasset Associates, Inc. NOTES Defining Sets of Classes  Tradeoff among  Ideal classes to implementpolicy  Classes you can teach  people to assign Classes you can ? teachsoftwareto assign  Be skeptical of automatic discovery of classes 22 Text Retrieval Systems  AKA search engines, semi-structured databases, text databases, etc. databases etc 23 Classification Search autonomous interactive long term transitory organizational personal structured independent ? ? ? 242012 Managing Electronic Records Conference 6.8
  • 9. Cohasset Associates, Inc. NOTES Some Distinctions Among Search Approaches  Exact Match vs. Ranked Retrieval vs.  "Concepts" Browsing vs. "Keywords" "Keywords"  Text Representations  Matching Aids 25 Exact Match Search  Query specifies conditions document must meet budget AND Knoxville AND (revised or preliminary)  Variants  Boolean B l  SQL  Faceted  Often (ambiguously) called "keyword" search 26 A Faceted Search Interface 272012 Managing Electronic Records Conference 6.9
  • 10. Cohasset Associates, Inc. NOTES Ranked Retrieval  Query specifies important attributes of desired documents  System statistically weights those attributes  Results returned in order of strength of match 28 Statistical Evidence in Ranked Retrieval  Corpus statistics  Word (and metadata) counts  Unsupervised learning  Clustering, LSI/LSA etc. Cl t i LSI/LSA, t  finds (maybe useless) patterns  Supervised learning  aka "relevance feedback"  learn indicators of user interest 29 Browsing  Hierarchies  Networks  Clusters  Spaces / Maps / Dimensions  make great pictures / demos  unclear if useful for finding information 302012 Managing Electronic Records Conference 6.10
  • 11. Cohasset Associates, Inc. NOTES Visual Analysis Examples (Presentation by Dr. Victoria Lemieux, Univ. British Columbia, at Society of American Archivist Annual Mtg. 2010, Washington, D.C.) With acknowledgments to Jeffrey Heer, Exploring Enron, http://hci.stanford.edu/jheer/projects/enron/, Adam Perer, Contrasting Portraits, http://hcil.cs.umd.edu/trs/2006-08/2006-08.pdf, 31 and Fernanda Viegas, Email Conversations, http://fernandaviegas.com/email.html 322012 Managing Electronic Records Conference 6.11
  • 12. Cohasset Associates, Inc. NOTES What Evidence Can The Search Software Use?  Words, phrases, etc.  Manually assigned categories  Metadata  Author, organization, creation date, change date, access date, length, file type,...  Contextual information (links, attachments,...) 34 What Resources Aid Matching?  Linguistic analysis  At word level or higher  Clusters / spaces / ...  Thesauri / semantic nets / concept maps / ...  Suited to your task?  Modifiable?  How is text determined to belong to category? 35 Concepts v. Keywords Supreme Court of Information Retrieval, Case No. 1-tfidf-0-2902, 2009  Search software marketing:  Them = keyword search = bad  Us = concept search = good  Reality: R lit  Both terms have referred to dozens of different technologies...  ...including some of the same ones!  Conceptual search is an aspiration, not a technology 362012 Managing Electronic Records Conference 6.12
  • 13. Cohasset Associates, Inc. NOTES Example of Boolean search string from U.S. v. Philip Morris  (((master settlement agreement OR msa) AND NOT (medical savings account OR metropolitan standard area)) OR s. 1415 OR (ets AND NOT educational testing service) OR (liggett AND NOT sharon a. liggett) OR atco OR lorillard OR (pmi AND NOT presidential management intern) OR pm usa OR rjr OR (b&w AND NOT photo*) OR phillip morris OR batco OR ftc test method OR star scientific OR vector group OR joe camel OR (marlboro AND NOT upper marlboro)) AND NOT (tobacco* OR cigarette* OR smoking OR tar OR nicotine OR smokeless OR synar amendment OR philip morris OR r.j. reynolds OR ("brown and williamson") OR ("brown & williamson") OR bat industries OR liggett group) 37 U.S. v. Philip Morris E-mail Winnowing Process  20 million  200,000  100,000  80,000  20,000  email hits based relevant produced placed on  records on keyword emails to opposing privilege  terms used party logs  (1%)   A PROBLEM: only a handful entered as exhibits at trial   A BIGGER PROGLEM: the 1% figure does not scale 38 Judicial endorsement of predictive analytics in document review by Judge Peck in Da Silva Moore v. PublicisGroupe(SDNY Feb. 24, 2012) This opinion appears to be the first in which a Court has approved of the use of computer-assisted review. pp p . . . What the Bar should take away from this Opinion is that computer-assisted review is an available tool and should be seriously considered for use in large- data-volume cases where it may save the producing party (or both parties) significant amounts of legal fees in document review. Counsel no longer have to worry about being the ‘first’ or ‘guinea pig’ for judicial acceptance of computer-assisted review . . . Computer-assisted review can now be considered judicially-approved for use in appropriate cases.2012 Managing Electronic Records Conference 6.13
  • 14. Cohasset Associates, Inc. NOTES Social Networking/Links Analysis Example From Marc Smith Posted on Flickr 40 Under Creative Commons License Judicial second guessing of failure to use e-search capabilities: Capitol Records v. MP3 Tunes, 261 F.R.D. 44 (S.D.N.Y. 2009)  “In [a prior case] the Court notes its dismay that the party opposing discovery of its ESI had organized its files in a manner which seemed to serve no purpose other than ‘to discourage audits. . .’ Similarly, in this case, [the party] host[ed] no ediscovery software on their servers and apparently are unable to conduct centralized email searches of groups of users without downloading them to a separate file and relying on the services of an outside vendor.” 41 Judicial second guessing of failure to use e-search capabilities: Capitol Records v. MP3 Tunes (con’t) Court went on to add: “The day will undoubtedly will come when burden arguments based on a large organization’s lack of internal ediscovery g y software will be received about as well as the contention that a party should be spared from retrieving paper documents because it had filed them sequentially, but in no apparent groupings, in an effort to avoid the added expense of file folders or indices.” 422012 Managing Electronic Records Conference 6.14
  • 15. Cohasset Associates, Inc. NOTES Problem 3: Innovative Thinking 43 The records management world of tomorrow…. References Background Law Review Referencing Autocategorization& Advanced Search J. Baron, “Law in the Age of Exabytes: Some Further Thoughts on ‘Information Inflation’ and Current Issues in E-Discovery Search, 17 Richmond J. Law & Technology (2011), see http://law.richmond.edu htt //l i h d d Latest “Predictive Coding” Case Law to follow in blogs online:  Da Silva Moore v PublicisGroupe& MSL Group, 11 Civ. 1279 (S.D.N.Y.) (Peck, M.J.) (Opinion dated Feb. 24 2012)  Kleen Products, LLC v. Packaging Corp. of America, 10 C 5711 (N.D. Ill.) (Nolan, M.J.) 452012 Managing Electronic Records Conference 6.15
  • 16. Cohasset Associates, Inc. NOTES Jason R. Baron Director of Litigation g Office of General Counsel National Archives and Records Administration (301) 837-1499 Email: jason.baron@nara.gov 46 Dave Lewis, Ph.D. David D. Lewis Consulting, LLC Chicago, IL Email: consult@DavidDLewis.com http//www.DavidDLewis.com 472012 Managing Electronic Records Conference 6.16