[AIIM18] Beyond Human Capacity: Using analytics to scale your everyday information migration and interface activities - Steven Clark

Beyond Human Capacity: Using Analytics
to Scale Your Everyday Information
Migration and Interface Activities
Steve Clark
Raytheon Company

Introduction
•  We’ve all been there:
•  Loads of file cabinets
with paper records:
•  Organized boxes of
paper:
•  Not so organized boxes
of paper:
•  Glad to say help is on the
way:

Topics
•  Background – Problems I’m trying to solve
•  Technology – Insight into how the technology works
•  Approach – Approach to how I used the tool
•  Revelations – Aha moments during the process
•  Analysis Results – Effectiveness of the tool
•  Recommendations – Lessons to improve results
•  Takeaways – Summary of steps to use if using the tool
•  Extensions – Other potential uses of the tool
•  Contact Info – How to contact me

Background
•  In Records Management there are a number of problems
challenges that arise that complicate the management of
records. Specifically:
1.  Repositories (legacy, archives, targeted for migration, etc)
that contain records and non-records. I’m interested in the
records only so the repository files need to be reviewed by
Subject Matter Experts (SMEs) in order to identify the
records: extremely time consuming!
2.  Imaging of existing paper records requires metadata in order
to manage the record: capturing the metadata is a manual
task and is also labor intensive!

Technology
•  Auto Classification is a semantic technology that can
be used to identify and categorize documents.
•  It is a machine learning application which means it is
trained to identify documents primarily through use of
relational keywords.
•  Training is an iterative process starting with clues for
record identification. Each iteration generally increases
the probability for accuracy.

How the Tool Works
Level 1 Terms Clue Clue Type Score Mod
Statement of Work (SOW) Statement of Work (SOW) Standard 0
Doc Name*=*statement* Metadata 35
Doc Name*=*work* Metadata 20
Doc Name*=*sow* Metadata 50
Doc Name*=*closure* Metadata -10
Doc Name*=*supplier* Metadata -15
Doc Name*=*ssow* Metadata -75
Doc Name*=*jsow* Metadata -75
The clues used
to identify the
item
The score used
to classify the
item
Each hit accumulates
the score. A Score of
50 gets classified
Negative
scores can be
used
The item I’m
trying to identify

Approach
•  The approach was to utilize the auto classification technology (tool) to determine the feasibility of
identifying records within a specific Engineering document repository.
•  Training of the tool is an important aspect of applying the technology. It is typically a one time
(non-recurring) effort that will need tweaking on occasion. The training consists of the following:
–  Identify the potential set of records that are likely in the repository. This establishes targets for the training.
In our case we established a Record Work Product List (RWPL) which identifies records within the company.
–  Since our target set was Engineering we focused on the identification of 70 Engineering type records.
–  Training consists of the identification of key words, phrases, and relationships and associating weighting of
these items in order to classify the document.
–  Establish goals for what is good enough: probability to identify records and probability for non-records. My
targets were 85% for records and 95% for non-records.
–  Select the training set. The training set should be a representative set of what is expected in the repository
and consist of enough items to establish the targets for all items.
–  Perform the training until targets are realized on the training set.

Realizations ( )
•  The repository I started on for this application had almost a million documents (855,000).
•  My first realization was that the documents were all in a proprietary repository and the tool was not able to
directly access items in this repository without developing a connector. So, analyzing documents in the repository
needed customization and I hadn’t time nor budget to develop a connector.
•  To get around this issue I had a report generated (in Excel) from the repository that provided me with three pieces
of information: Document Number, Document Title and Document Family (set of 88 types). I was curious to see if I
could use the tool to identify Records based on this limited information.
•  From this report I established a training set of 50,000 line items. I used a large set due to the limited information
provided. Note: If using entire documents (vs titles) much less of a training set can be used.
•  Training took me 7 passes:
–  Set an initial set of “clues” for a set of items. After each run the results were analyzed to determine how many items were
classified and the overall accuracy of the classification
–  The item “non-record” was added after the realization that identification of non-records assists in the identification of records
–  The goal of the first 4 passes was basically to increase the number of types of Records identified with some emphasis on
accuracy
–  The next three passes focused on overall accuracy of the clues. Accuracy is actually more time consuming as it is a manual
process: every item needs to be assessed
–  Another realization was that many of the items were not classifiable: Not enough information was contained in the data set to
render a classification (e.g., sometimes the document number was just repeated in the document title field)

Training Results
•  Seven runs were made on the training set of 50,505 items:
•  The classification percentage was monitored because not all of the items were
considered classifiable:
•  Targets were actually achieved after the sixth run but one additional run was
made to slightly enhance accuracy.
CLASSIFICATION PERCENTAGE 1/11/2017 1/13/2017 1/16/2017 1/18/2017 2/15/2017 2/28/17 3/16/17
# record work products 54 57 64 66 72 70 70
# records classified 13227 17755 27490 32404 34291 35086 34834
# non records classified 0 6091 6623 8621 9732 10390 10791
percentage classified 26.2% 35.2% 54.4% 64.2% 67.9% 69.5% 69.0%
CLASSIFICATION ACCURACY 1/11/2017 1/13/2017 1/16/2017 1/18/2017 2/15/2017 2/28/17 3/16/17
# record work products 54 57 64 66 72 70 70
# records classified 13227 17755 27490 32404 34291 35086 34834
# non records classified 0 6091 6623 8621 9732 10390 10791
non-record accuracy ----- ----- ----- ----- 94.8% 98.3%
overall classification accuracy ----- ----- ----- ----- 91.0% 95.1%

Auto Classification Percentage
•  Detailed analysis was completed on the 6th run to
determine the overall classification accuracy:
2/28/17 CLASSIFICATION ANALYSIS counts PERCENT
classified correctly 33550 66.4%
classifyable 7928 15.7%
incorrectly classified (considered fixable) 1539 3.0%
unclassifyable 5484 10.9%
blank - not analyzed as classifyable or not 2004 4.0%
Total 50505 100.0%
CLASSIFICATION ESTIMATION counts PERCENT Totals
estimated classifyable 44202 87.5% 748719
estimated unclassifyable 6303 12.5% 106772
Total 50505 855491

Auto Classification Results
•  The auto classification training clue set was run on the
remaining set of 804,954 items consisting of 8 batch
runs of ~100,000 each :
•  Overall the results achieved on the larger set (67.6%)
was consistent with the training set (69%) within -1.4%
CLASSIFICATION PERCENTAGE batch 1 batch 2 batch 3 batch 4 batch 5 batch 6 batch 7 batch 8
# record work products 70 70 70 70 70 70 70 70
# records classified 66644 66703 66445 66311 68700 69131 68016 71431
# non records classified 21307 21109 21131 21065 21841 22044 21706 22828
unclassified 33355 33296 33554 33688 31299 30868 31983 32454
percentage classified 66.6% 66.7% 66.4% 66.3% 68.7% 69.1% 68.0% 68.0%

Recommendations
•  If you choose to use document titles for classification purposes:
–  Refrain from using just numbers or cryptic abbreviations (use
standard)
–  Standardize or eliminate certain shorthand type notations (e.g., appr
or appv for approved)
–  Be more rigorous when selecting the document type (inaccuracies
here cause classification errors)
–  Ensure spelling is correct
–  If document numbering could be standardized this would also assist
the accuracy (e.g., notices of revision were sometimes prefixed with
NORxxxx)

Takeaways
•  Identify the potential set of records that are likely in the repository.
Modulate the set of records such that you can achieve the desired
targets.
–  Use retention policy as a guide.
•  Establish goals for what is good enough: probability to identify
records and probability for non-records.
•  Select the training set. The training set should be a representative
set of what is expected in the repository and consist of enough
items to establish the targets for all items.
•  Ensure that the documents are searchable especially if in *.PDF
format.

Extensions
•  The auto-classification technology can be used to
extract metadata from documents:
–  Especially forms that have designated information
partitioning;
–  Documents that follow standard headers, paragraphs or
that follow certain section titles and topics.
•  We also did some work where the tool can be used to
screen for personally identifiable information (pii) on
documents.

Extensions
•  Example of a form where yellow shaded represents metadata
extracted from a form:
Author Name Employee Number
John Doe 123456
POC John Doe
Employee Number 123456
POC Mail Stop / Location 123456
POC Telehone 333-555-4444
Date 12/12/2012
Business Unit Corp
Author Funnctional Organization or Region Technology
POC Cost Center 66655
Author / Publication Information
Point of Contact

Extension Results
•  Record metadata was extracted from these forms into a
spreadsheet and used the spreadsheet to bulk upload the
items into the record repository complete with metadata.
–  Alternatively this would have been a fully manual process.
•  We noted that in some cases a field we needed was left
blank which led us to have the form change the field to a
required field.
•  Another repository had 5000+ documents all of which had
a form as part of the document set.

Contact Info
Steve Clark
Raytheon Company
Company Record Manager
781-522-5151 (o)
339-227-7678 (c)
Steven_f_clark@raytheon.com

[AIIM18] Beyond Human Capacity: Using analytics to scale your everyday information migration and interface activities - Steven Clark

More Related Content

Similar to [AIIM18] Beyond Human Capacity: Using analytics to scale your everyday information migration and interface activities - Steven Clark

More from AIIM International

Recently uploaded

[AIIM18] Beyond Human Capacity: Using analytics to scale your everyday information migration and interface activities - Steven Clark