SlideShare a Scribd company logo
1 of 49
DATA EXTRACTION FROM UNSTRUCTURED
FINANCIAL DOCUMENTS – THE DATA
FACTORY
BOSTON , OCTOBER 1ST, 2013
GEORGE ROTH – RECOGNOS INC, SAN FRANCISCO,CA,USA
DREW WARREN – RECOGNOS FINANCIAL, NEW YORK, USA
DIA TRAMBITAS – RECOGNOS, CLUJ, ROMANIA – SEMANTIC TECHNOLOGY
IOANA BARBANTAN – RECOGNOS , CLUJ , ROMANIA – MACHINE LEARNING
IULIAN MARGARINTESCU – RECOGNOS CLUJ, ROMANIA – SYSTEM ARCHITECT
1
WHAT IS THE DATA FACTORY
 The “data factory” is a complex hybrid system
 Software systems together with some manual exception
processing
 Used to extract data from a collection of un-formatted or
semi-formatted documents
 Data Factory is designed in such a way that is improved
continuously based on optimizations that are done using
(the number and role of the human participants
decreases in time.)
 The product of the data factory consist of Data Feeds,
e.g. Mutual Fund Data, Vendor Contract Data, Security
Master Data, Issuer Data , etc.
 The users of the data feeds are receiving the data
based on subscriptions, using the chosen delivery
method. 2
THE COMPONENTS OF THE DATA FACTORY
 Data Sources that provides access to the documents used in the
extraction process (“raw materials”)
 Content Management Collaboration Environment - that initiates
the workflows associated with the extraction process, also
providing work queues where the designated human workforce
performs tasks related to : exception handling, Q&A process,
manual overrides, approvals, etc.
 Databases where the extracted data together with the sources
documents are stored (we are using Raven DB – non SQL
database – hosted on Amazon)
 Extraction software – combination of multiple technologies that
is used in the extraction process: NLP software, machine
learning software, regular expression processing
 API – use to deliver the extracted data – can be a REST based,
batch delivery (XLS, XML, CSV, etc.)
 UI –used to browse and query the document collection together
with the extracted data 3
THE TEAM: DATA FACTORY WORKERS
 Data Feed Manager – each product of the Data Factory (i.e. a Data
Feed) has a Data feed Manager
 Data Analyst – Subject Expert – a business analyst that has expertise
in the documents and associated business processes (e.g. Mutual Fund
Data Analyst, SEC Documents Expert, etc.).
 Data Analysts – business people that are trained to write NLP extraction
scripts and also understand the data. They contribute to the setup phase
- write the NLP extraction scripts - and to the maintenance phase- fulfill
computer assisted tasks related to exception processing, issue
resolution, etc.
 Software Engineers – work on setting up the production line for a Data
Feed, being in charge to assemble/customize the software components
used by the Data Factory. The role of the software engineers decreases
over time after the data factory is in production.
4
HOW TO SETUP A NEW “PRODUCTION LINE”
5
HOW DOES THE PRODUCTION LINE WORK
6
CASE STUDY
 Industry Utility for the US Mutual Fund Industry – implemented
at DTCC New York – went live in April 2013
 20,000 Mutual Funds – 2 Million Data Points
 Changing every year – backlog – one year
 Validated the existing “Profile Data” – created through Crowd
Sourcing of the funds
 Found errors in filings – refilling
 Duration – 6 months
 10 people involved in the factory 7
SAMPLE TAXONOMY (FRAGMENT)
str_exchange_buy_schedule_code yes Send
str_forfeitures_distribution_waiver_code yes Send
str_hardship_distribution_waiver_code yes Send
str_involuntary_distribution_waiver_code yes Send
str_loan_distribution_waiver_code yes Send
str_minimum_amt yes Send
str_payout_calculation_amt_code no NID Exclude
str_plan_fee_waiver_code yes Send
str_qualified_distribution_waiver_age yes Send
str_qualified_distribution_waiver_code yes Send
str_rate_{0} yes Send
8
GOALS:
• Structure SEC filings in order to allow an easier and
customized navigation within documents;
• Interactive table of contents;
• Links between funds, share classes and relevant chapters;
• Extract financial data points from Prospectus/SAI/(Sticker?)
documents;
• Better and faster exploitation of information;
• Comparison of extracted data points with data from Profile;
• Structured data export to different formats  support for user
defined analysis on top of the extracted data;
• Maintain provenance link
Filing
MFIPSEC US
9
THE ANALYSIS PROCESS
1. Daily import - new filings are imported from the SEC platform;
2. Each filing (package containing several documents) is split into
documents;
3. XBRL, Prospectus (485BPOS), SAI(497) and Sticker (497)
documents are further analyzed:
a) XBRL is automatically parsed and extracted data can be visualized
using a dedicated user interface (www.mfip.info)
b) Working on a query system using JSONiq
c) Prospectus, SAI and Sticker documents go through an iterative semi-
automatic processing chain that allows their 2 level structuring:
 Division into chapters for an easy navigation;
 Data extraction for an easy data exploitation.
10
DOCUMENT STRUCTURING – PHASE 1
 Parsing
 The filing is divided into Document Parts, Sections and
Chapters;
 Funds described in each Document Part are identified
together with their list of share classes.
 Linking
 Links between sections and share classes are created
 Validation
 The two previous steps are checked by a human expert
and corrected if necessary. 11
PHASE 1 – FILING PARSING PROCESSOriginalSECFiling-hasafilingclasssuchas
485APOS,485BPOS,497,497K,etc.
Document Part 1
Section 1.1 Chapter list 1.1
… …
Section 1.n Chapter list 1.n
Fund 1.1 Share class list 1.1
…
Fund 1.k Share class list 1.k
…
… …
Document Part k
Section k.1 Chapter list k.1
…
Section k.r Chapter list k.m
Fund k.1 Share class list k.1
Fund k.2 Share class list k.2
The number and types of Document
Parts contained within a filing depends
on the filing class.
There are currently 10 document
types available: N-CSR, N-CSRS, SAI
Sticker, SAI, Summary Sticker,
Summary Prospectus, Statutory
Prospectus, Statutory Sticker,
Statutory Prospectus, N-Q, 24F-2NT
- e.g One Statutory Prospectus
identified within the current document
part
- e.g. One Summary Prospectus
identified within the current document
part
- e.g One SAI identified within the
current document part
- e.g. All the Summary Prospectuses
identified within the current filing
- e.g. All the SAIs identified within the
current filing
- e.g. All the Statutory Prospectuses
identified within the current filing
12
Filing type
List of funds
for the
current filing
List of Share
Classes for
the current
filing
13
Document Part 1
Document Part 2
Section 1.1
Section 1.2
Section 1.4
Section 2.1
Section 2.2
Chapters
14
Table of contents
automatically detected
using available templates
Navigable links are
automatically created
between the chapter name
and the chapter start index
in the document
15
DOCUMENT STRUCTURING – PHASE 1
 Parsing
 The filing is divided into Document Parts, Sections and
Chapters;
 Funds described in each Document Part are identified
together with their list of share classes.
 Linking
 Links between sections and share classes are created
 Validation
 The two previous steps are checked by a human expert
and corrected if necessary. 16
PHASE 1 – THE LINKING PROCESSOriginalSECFiling-hasafilingclasssuchas
485APOS,485BPOS,497,497K,etc.
Document Part 1
Section 1.1 Chapter list 1.1
… …
Section 1.n Chapter list 1.n
Fund 1.1 Share class list 1.1
…
Fund 1.k Share class list 1.k
…
… …
Document Part k
Section k.1 Chapter list k.1
…
Section k.r Chapter list k.m
Fund k.1 Share class list k.1
Fund k.2 Share class list k.2
From a flatter structure towards…
17
OriginalSECFiling-hasafiling
classsuchas485APOS,485BPOS,
497,497K,etc.
Document
Part 1
Section 1.1
Fund 1.1.1 Share class list 1.1.1
… …
Fund 1.1.k Share class list 1.1.k
Chapter list 1.1
…
… …
Section 1.n
Fund 1.n.1 Share class list 1.n.1
Fund 1.n.k Share class list 1.n.k
Chapter list 1.n
…
…
… …
…
…
…
..
Document
Part k
Section k.1
Fund k.1.1 Share class list k.1.1
Fund k.1.2 Share class list k.1.2
Chapter list k.1
…
Section k.r
Fund k.m.1
Share class list k.m.1
Chapter list k.m
- All the Statutory
Prospectuses identified
within the current filing
- All the SAIs identified
within the current filing
- All the Summary
Prospectuses identified
within the current filing
- One Statutory Prospectus
identified within the current
document part
- One Summary Prospectus
identified within the current
document part
- One SAI identified within
the current document part
PHASE 1 – THE LINKING PROCESS
A hierarchical structure
Hierarchical links
are created
between Sections
and Funds
18
Each section has
associated a list
of share classes
19
DOCUMENT STRUCTURING – PHASE 1
 Parsing
 The filing is divided into Document Parts, Sections and
Chapters;
 Funds described in each Document Part are identified
together with their list of share classes.
 Linking
 Links between sections and share classes are created
 Validation
 The two previous steps are checked by a human expert
and corrected if necessary. 20
PHASE 2 – DATA EXTRACTION
2. Data extraction from Prospectus, SAI and Sticker
documents:
 Analysis of complex documents (200-300 pages of unstructured
English text) and extraction of relevant features of the mutual funds
 The extraction concentrates on a taxonomy of 260 data points
representing features or characteristics of mutual funds
 A hybrid data extraction approach
 Rule based linguistic analysis
 Custom extracted - data propagation algorithm
 Check data point context from historic document and propagate value
if contexts match
 Custom table parsers based on
 Regular expressions
 Machine learning
21
CLASS Y ; IVLCX / Class C ; IENAX / Class A ;
CLASS C ; IBUTX / Class B ; CLASS B ; CLASS A ;
CLASS C ; CLASS Y ; IAUYX / Class Y ; MGAVX /
CLASS B ; ITHCX / Class C ; CLASS C ; ITYAX /
Class A ; FLISX / Investor Class ; CLASS B ;
CLASS R5 ; IEFCX / Class C ; CLASS A ; MSVCX /
CLASS C ; CLASS C ; ILSBX / Class B ; ILSYX /
Class Y ; CLASS A ; MSAJX / CLASS R5 ; CLASS
B ; IENIX / CLASS R5 ; CLASS B ; MSAVX /
CLASS A ; MSARX / CLASS R ; FSTUX / Investor
Class ; ITYYX / Class Y ; CLASS A ; FGLDX /
Investor Class ; IGDAX / Class A ; CLASS Y ;
IENYX / Class Y ; ITYBX / Class B ; IENBX / Class B
; IUTCX / Class C ; Class R ; CLASS R5 ; CLASS Y
; IGDYX / Class Y ; FTPIX / CLASS R5 ; IGDCX /
Class C ; FSTEX / Investor Class ; ILSRX / Class R ;
MSAIX / CLASS Y ; CLASS B ; CLASS R5 ; IGDBX
/ Class B ; IAUTX / Class A ; CLASS R ; ILSAX /
Class A ; FSIUX / CLASS R5 ; CLASS R ; CLASS Y
; CLASS C ; CLASS A ; FTCHX / Investor Class ;
(61 Share Class IDs)
HTML Filing
Prospectus 1
Prospectus 2
Prospectus 3
Prospectus 15
Acc nb: 0000950123-12-011157
SAI 1
SAI 2
…
IENAX / Class A ; IENBX / Class B; FSTEX / Investor Class; IENYX /
Class Y; IEFCX / Class C; CLASS C; CLASS A; CLASS B;CLASS Y;
IGDBX / Class B; IGDCX / Class C; IGDYX / Class Y; IGDAX /
Class A; FGLDX / Investor Class;
ILSAX / Class A; ILSRX / Class R; ILSYX / Class Y; ILSBX / Class
B; FLISX / Investor Class; IVLCX / Class C;
CLASS R5
ITHCX / Class C; ITYAX / Class A; ITYYX / Class Y; ITYBX / Class B; FTPIX /
CLASS R5; FTCHX / Investor Class; IBUTX / Class B; IAUYX / Class Y; FSTUX /
Investor Class; IUTCX / Class C; IAUTX / Class A; FSIUX / CLASS R5; IVLCX /
Class C; FLISX / Investor Class; ILSBX / Class B; ILSYX / Class Y; ILSRX / Class
R; ILSAX / Class A; FGLDX / Investor Class; IGDAX / Class A; IGDYX / Class Y;
IGDCX / Class C; IGDBX / Class B; IENAX / Class A; IEFCX / Class C; IENIX /
CLASS R5; IENYX / Class Y; IENBX / Class B; FSTEX / Investor Class;
CLASS C; CLASS A; CLASS B; CLASS Y; MGAVX / CLASS B; MSVCX /
CLASS C; MSAJX / CLASS R5; MSAVX / CLASS A; MSARX / CLASS R;
MSAIX / CLASS Y; CLASS Y; CLASS A; CLASS C; CLASS B; CLASS R5;
CLASS R; CLASS Y; CLASS C; CLASS R5; CLASS B; CLASS R; CLASS A;
CLASS A; CLASS B; CLASS Y; CLASS C; CLASS B; CLASS C; CLASS A;
Class R; CLASS Y; CLASS R5;
Step 1 - Filings are parsed  Prospectus and SAI doc linked with
the Share Classes they describe
MFIP
Parsing +
Linking
22
Step 2 - Prospectus and SAIs are grouped based on the common
share classes
Prospectus 1
IENAX / Class A ; IENBX / Class B; FSTEX / Investor Class; IENYX /
Class Y; IEFCX / Class C; CLASS C; CLASS A; CLASS B;CLASS Y;
SAI 1
ITHCX / Class C; ITYAX / Class A; ITYYX / Class Y; ITYBX / Class B; FTPIX /
CLASS R5; FTCHX / Investor Class; IBUTX / Class B; IAUYX / Class Y; FSTUX /
Investor Class; IUTCX / Class C; IAUTX / Class A; FSIUX / CLASS R5; IVLCX /
Class C; FLISX / Investor Class; ILSBX / Class B; ILSYX / Class Y; ILSRX / Class
R; ILSAX / Class A; FGLDX / Investor Class; IGDAX / Class A; IGDYX / Class Y;
IGDCX / Class C; IGDBX / Class B; IENAX / Class A; IEFCX / Class C; IENIX /
CLASS R5; IENYX / Class Y; IENBX / Class B; FSTEX / Investor Class;
Prospectus 1
IENAX / Class A ; IENBX / Class B; FSTEX / Investor Class; IENYX /
Class Y; IEFCX / Class C; CLASS C; CLASS A; CLASS B;CLASS Y;
SAI 2
CLASS C; CLASS A; CLASS B; CLASS Y; MGAVX / CLASS B; MSVCX /
CLASS C; MSAJX / CLASS R5; MSAVX / CLASS A; MSARX / CLASS R;
MSAIX / CLASS Y; CLASS Y; CLASS A; CLASS C; CLASS B; CLASS R5;
CLASS R; CLASS Y; CLASS C; CLASS R5; CLASS B; CLASS R; CLASS A;
CLASS A; CLASS B; CLASS Y; CLASS C; CLASS B; CLASS C; CLASS A;
Class R; CLASS Y; CLASS R5;
Document Pair 1
Document Pair 2
…
Document Pair n
23
Prospectus 1
SAI 1
NLP/Table Parser
Extraction ENGINE
ROA Calculation Methodology Code
•01
•Prospectus, Start Index 337811, Length 277
•Prospectus, Start Index 338089, Length 193
Link by Spouse Code
•Yes
•SAI, Start Index 19201, Length 2092
• No
•Prospectus. Start Index 25789, Length 1203
•…
…
…
…
Step 3 – Paired Prospectus and SAI documents are automatically
analyzed and a set of data points are extracted
No linking between
Extracted Data Points
and Share Class ID!!!
IENAX / Class A ; IENBX
/ Class B; FSTEX /
Investor Class; IENYX /
Class Y; IEFCX / Class C
Document Pair 1
24
Step 4 – Extracted data is validated, missing values are searched
and added if available. All values are linked to Share Classes.
25
System overview
26
DATA EXTRACTION APPROACH
 Linguistic analysis
 Cogito Studio – Expert System Italy
 Extraction rules written in 4GL
 Custom table parsers
 Based on syntactic rules
 Based on machine learning
 Data propagation algorithm
 Historic data – automatically or manually extracted –
can be used to guide/improve the extraction process
 If value together with its context is the same as in old
document it is automatically propagated to new document,
automatically linked and validated;
 If value changed but context is the same, the new value is
extracted automatically, linked and validated. 27
EXAMPLE OF TEXT FROM WHICH A FEATURE IS
EXTRACTED – UNDERLINED IN YELLOW
28
THE EXTRACTION IN PREVIOUS EXAMPLES
WAS DONE USING THE FOLLOWING RULE
The IDs are used to identify
the semantics associated to
a word (the syncon of a
word).
Several primitives are available that
can refer to sets of words senses
(syncons) with very close meanings.
29
THE EXTRACTION RESULTS ARE VISUALIZED IN A
CMS BY FINANCIAL EXPERTS THAT CAN VALIDATE
OR MODIFY THE VALUE
30
VALIDATED EXTRACTION RESULTS ARE
DELIVERED TO OUR CUSTOMERS
Field
taxonomy
Extracted
value
The text from which
the value was
extracted
31
DATA POINTS CLASSIFICATION AND EXTRACTION
FROM TABLES. A MACHINE LEARNING APPROACH
32
EXTRACT DATA POINTS FROM TABLES
USING MACHINE LEARNING
33
 Dataset creation and Model building
 Tables identification
 Dataset construction from tables
 Create models based on the dataset
 Assign label Yes/No to each of the instances
 Data classification
 Identify data point values based on the learned models
 Performance evaluation
EXTRACT DATA POINTS FROM TABLES
USING MACHINE LEARNING
System design
34
TABLE IDENTIFICATION
 Given as input a Prospectus, identify all tables
 Classify each table in one of the categories of
interest: IRA, Fees and Expenses, Distribution
Reinvestment
35
Table
type
Fees and Expenses
Tables
Distribution
Reinvestment
Tables
IRA
Tables
EXTRACT CONTEXT FROM TABLES
 Context Extraction
 Identify horizontal context
 Characters and words at the left of the position of the value
 Identify vertical context
 Define window surrounding the value
 Select characters and words in the window above the value
 The context in the example:
{account, Individual, Initial, investment, retirement}
 After applying preprocessing and stemming:
{account, individu, init, investm, retir}
Vertical
context
Horizontal
context
36
TEXT PREPROCESSING
 Stop words removal
 Words that carry no relevant information to our problem
including conjunctions and prepositions
 Punctuation and numeric values elimination
 Stemming
 Algorithm used: Lovins Stemmer
 Algorithm description:
 294 endings, 29 conditions, 35 transformation rules
 Steps:
 Step 1: find the longest ending, associate it with a condition and
remove it
 Step 2: apply the transformation whether or not the ending was
removed
 Why use stemming:
 Normalization step: different forms of a word are set to the same
stem
 Improve classification accuracy
 Reduce number of attributes in the dataset
37
STEMMING EXAMPLE
 Problem: identify the field
Minimum_IRA_Initial_Purchase_Amount in the table:
 The definition of the field contains the words
{account, individual, initial, investment, retirement}
 Field In Table = {accounts, individual, initial, minimum, no,
retirement}
 The similarity between the definition and the field is: 2.23
 After applying the stemmer:
 Definition of field = {account, individu, init, investm, retir}
 Field In Table = {account, individu, init, minim, retir, no}
 The similarity between the definition and the field becomes: 1.71
they are more similar
38
CONTEXT AS INSTANCE
 0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, Minimum_IRA_Initial_Purchase_Amount
 Context {account, individu, init, investm, retir}
39
individu
retir
account
init investm
coverdell subsequen
t
account
init
investm
retir
individu
EUCLIDEAN DISTANCE BETWEEN INSTANCES

40
nominim
DATASET CONSTRUCTION AND MODEL
BUILDING
 Create dataset from processed documents for each data
point
 Examples representation as bag of words:
each unique word in the examples
becomes an attribute in the dataset
 Dataset content
 Context {account, individu, init, investm, retir}
 Class  data point category {Minimum_IRA_Initial_Purchase_Amount}
 Build model
 Apply Random Forest
 algorithm for classification of instances
 consists in constructing a multitude of Decision Trees
 New instance classification:
 Run down on all trees  classification accuracy from each tree
 Classification accuracy result is computed as the average of the previously
obtained accuracies 41
42
DATA POINTS FROM TABLES USING
MACHINE LEARNING - STEPS
 Dataset creation and Model building
 Tables identification
 Dataset construction from tables
 Create models based on dataset
 Assign label Yes/No to each of the instances
 Data classification
 Identify data point values based on the learned models
 Performance evaluation
43
HOW THE APPROACH IS USED FOR A NEW
PROSPECTUS
 Identify tables and classify them with table types
 Load models according to the types
 For each possible value identified in the table,
extract context (including stemming)
 Create a new instance with the context and
unknown class {context, ?}
 Run instance through all models  values for
accuracy are obtained for each model
 The label of the instance will be the same as the
model for which the greatest membership was
obtained 44
HOW THE APPROACH IS USED FOR A NEW
PROSPECTUS - EXAMPLE
IRA table
Identified and
extracted value
Vertical
context
Horizontal
context 45
PERFORMANCE EVALUATION
 Evaluate performance of the system by computing
the classification accuracy on unseen data
 Classification accuracy refers to the ability of the
model to correctly predict the class label of new
unseen data
 Reported as %
 Model accuracy refers to the degree of fitting the
data and the model
 We computed the accuracy for each of the models
we built
46
MODEL ACCURACY FOR IRA DATA POINTS
47
 The classification accuracy varies between 41% and 85% due to:
 the numerous ways in which a data point can be expressed in the
tables (the various ways of expressing each data point)
 the number of examples used for training the model
 the context extraction correctness
DEMO – WWW.MFIP.INFO
48
CONTACT
 Business Information:
Drew Warren – Recognos Financial New York
dwarren@recognosfinancial.com
 Technical Information
George Roth – Recognos Inc. – San Francisco
groth@recognos.com
The system can be seen at: www.mfip.info
The presentation will be published on Slide Share by
George Roth
49

More Related Content

Similar to The Semantic Data Factory Boston Text Analystics World 2013

An overview of popular analytics toolkits
An overview of popular analytics toolkitsAn overview of popular analytics toolkits
An overview of popular analytics toolkitsRamkumar Ravichandran
 
2016 SDMX Experts meeting, SDMX system in the Banco de España, Eduardo Bollo
2016 SDMX Experts meeting, SDMX system in the Banco de España, Eduardo Bollo2016 SDMX Experts meeting, SDMX system in the Banco de España, Eduardo Bollo
2016 SDMX Experts meeting, SDMX system in the Banco de España, Eduardo BolloStatsCommunications
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabadGeohedrick
 
ACCT202 SUMMER 2018 ONLINE SPECIAL ASSIGNMENTFINANCIAL RATIOS FO.docx
ACCT202 SUMMER 2018 ONLINE SPECIAL ASSIGNMENTFINANCIAL RATIOS FO.docxACCT202 SUMMER 2018 ONLINE SPECIAL ASSIGNMENTFINANCIAL RATIOS FO.docx
ACCT202 SUMMER 2018 ONLINE SPECIAL ASSIGNMENTFINANCIAL RATIOS FO.docxnettletondevon
 
Journal of Physics Conference SeriesPAPER • OPEN ACCESS.docx
Journal of Physics Conference SeriesPAPER • OPEN ACCESS.docxJournal of Physics Conference SeriesPAPER • OPEN ACCESS.docx
Journal of Physics Conference SeriesPAPER • OPEN ACCESS.docxLaticiaGrissomzz
 
Application portfolio development.advadisadvan.pptx
Application portfolio development.advadisadvan.pptxApplication portfolio development.advadisadvan.pptx
Application portfolio development.advadisadvan.pptxAmanJain384694
 
Cis5205 s2-16-assignment2-specs
Cis5205 s2-16-assignment2-specsCis5205 s2-16-assignment2-specs
Cis5205 s2-16-assignment2-specsSandeep Ratnam
 
DBT PU BI Lab Manual for ETL Exercise.pdf
DBT PU BI Lab Manual for ETL Exercise.pdfDBT PU BI Lab Manual for ETL Exercise.pdf
DBT PU BI Lab Manual for ETL Exercise.pdfJanakiramanS13
 
IT_RFO10-14-ITS_AppendixA_20100513
IT_RFO10-14-ITS_AppendixA_20100513IT_RFO10-14-ITS_AppendixA_20100513
IT_RFO10-14-ITS_AppendixA_20100513Alexander Doré
 
Business Analytics Project Example
Business Analytics Project ExampleBusiness Analytics Project Example
Business Analytics Project ExampleMBA Capstone Project
 

Similar to The Semantic Data Factory Boston Text Analystics World 2013 (20)

Data Management
Data ManagementData Management
Data Management
 
The New Event Load
The New Event LoadThe New Event Load
The New Event Load
 
H1803014347
H1803014347H1803014347
H1803014347
 
An overview of popular analytics toolkits
An overview of popular analytics toolkitsAn overview of popular analytics toolkits
An overview of popular analytics toolkits
 
Aksh 117 bpd_sd (1)
Aksh 117 bpd_sd (1)Aksh 117 bpd_sd (1)
Aksh 117 bpd_sd (1)
 
Accounting process
Accounting processAccounting process
Accounting process
 
2016 SDMX Experts meeting, SDMX system in the Banco de España, Eduardo Bollo
2016 SDMX Experts meeting, SDMX system in the Banco de España, Eduardo Bollo2016 SDMX Experts meeting, SDMX system in the Banco de España, Eduardo Bollo
2016 SDMX Experts meeting, SDMX system in the Banco de España, Eduardo Bollo
 
R12 Mass Additions
R12 Mass AdditionsR12 Mass Additions
R12 Mass Additions
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
Splunk Components
Splunk ComponentsSplunk Components
Splunk Components
 
Security Center.pdf
Security Center.pdfSecurity Center.pdf
Security Center.pdf
 
AW2016_v01
AW2016_v01AW2016_v01
AW2016_v01
 
ACCT202 SUMMER 2018 ONLINE SPECIAL ASSIGNMENTFINANCIAL RATIOS FO.docx
ACCT202 SUMMER 2018 ONLINE SPECIAL ASSIGNMENTFINANCIAL RATIOS FO.docxACCT202 SUMMER 2018 ONLINE SPECIAL ASSIGNMENTFINANCIAL RATIOS FO.docx
ACCT202 SUMMER 2018 ONLINE SPECIAL ASSIGNMENTFINANCIAL RATIOS FO.docx
 
Journal of Physics Conference SeriesPAPER • OPEN ACCESS.docx
Journal of Physics Conference SeriesPAPER • OPEN ACCESS.docxJournal of Physics Conference SeriesPAPER • OPEN ACCESS.docx
Journal of Physics Conference SeriesPAPER • OPEN ACCESS.docx
 
Application portfolio development.advadisadvan.pptx
Application portfolio development.advadisadvan.pptxApplication portfolio development.advadisadvan.pptx
Application portfolio development.advadisadvan.pptx
 
Cis5205 s2-16-assignment2-specs
Cis5205 s2-16-assignment2-specsCis5205 s2-16-assignment2-specs
Cis5205 s2-16-assignment2-specs
 
DBT PU BI Lab Manual for ETL Exercise.pdf
DBT PU BI Lab Manual for ETL Exercise.pdfDBT PU BI Lab Manual for ETL Exercise.pdf
DBT PU BI Lab Manual for ETL Exercise.pdf
 
IT_RFO10-14-ITS_AppendixA_20100513
IT_RFO10-14-ITS_AppendixA_20100513IT_RFO10-14-ITS_AppendixA_20100513
IT_RFO10-14-ITS_AppendixA_20100513
 
Business Analytics Project Example
Business Analytics Project ExampleBusiness Analytics Project Example
Business Analytics Project Example
 
Attachment_0.pdf
Attachment_0.pdfAttachment_0.pdf
Attachment_0.pdf
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

The Semantic Data Factory Boston Text Analystics World 2013

  • 1. DATA EXTRACTION FROM UNSTRUCTURED FINANCIAL DOCUMENTS – THE DATA FACTORY BOSTON , OCTOBER 1ST, 2013 GEORGE ROTH – RECOGNOS INC, SAN FRANCISCO,CA,USA DREW WARREN – RECOGNOS FINANCIAL, NEW YORK, USA DIA TRAMBITAS – RECOGNOS, CLUJ, ROMANIA – SEMANTIC TECHNOLOGY IOANA BARBANTAN – RECOGNOS , CLUJ , ROMANIA – MACHINE LEARNING IULIAN MARGARINTESCU – RECOGNOS CLUJ, ROMANIA – SYSTEM ARCHITECT 1
  • 2. WHAT IS THE DATA FACTORY  The “data factory” is a complex hybrid system  Software systems together with some manual exception processing  Used to extract data from a collection of un-formatted or semi-formatted documents  Data Factory is designed in such a way that is improved continuously based on optimizations that are done using (the number and role of the human participants decreases in time.)  The product of the data factory consist of Data Feeds, e.g. Mutual Fund Data, Vendor Contract Data, Security Master Data, Issuer Data , etc.  The users of the data feeds are receiving the data based on subscriptions, using the chosen delivery method. 2
  • 3. THE COMPONENTS OF THE DATA FACTORY  Data Sources that provides access to the documents used in the extraction process (“raw materials”)  Content Management Collaboration Environment - that initiates the workflows associated with the extraction process, also providing work queues where the designated human workforce performs tasks related to : exception handling, Q&A process, manual overrides, approvals, etc.  Databases where the extracted data together with the sources documents are stored (we are using Raven DB – non SQL database – hosted on Amazon)  Extraction software – combination of multiple technologies that is used in the extraction process: NLP software, machine learning software, regular expression processing  API – use to deliver the extracted data – can be a REST based, batch delivery (XLS, XML, CSV, etc.)  UI –used to browse and query the document collection together with the extracted data 3
  • 4. THE TEAM: DATA FACTORY WORKERS  Data Feed Manager – each product of the Data Factory (i.e. a Data Feed) has a Data feed Manager  Data Analyst – Subject Expert – a business analyst that has expertise in the documents and associated business processes (e.g. Mutual Fund Data Analyst, SEC Documents Expert, etc.).  Data Analysts – business people that are trained to write NLP extraction scripts and also understand the data. They contribute to the setup phase - write the NLP extraction scripts - and to the maintenance phase- fulfill computer assisted tasks related to exception processing, issue resolution, etc.  Software Engineers – work on setting up the production line for a Data Feed, being in charge to assemble/customize the software components used by the Data Factory. The role of the software engineers decreases over time after the data factory is in production. 4
  • 5. HOW TO SETUP A NEW “PRODUCTION LINE” 5
  • 6. HOW DOES THE PRODUCTION LINE WORK 6
  • 7. CASE STUDY  Industry Utility for the US Mutual Fund Industry – implemented at DTCC New York – went live in April 2013  20,000 Mutual Funds – 2 Million Data Points  Changing every year – backlog – one year  Validated the existing “Profile Data” – created through Crowd Sourcing of the funds  Found errors in filings – refilling  Duration – 6 months  10 people involved in the factory 7
  • 8. SAMPLE TAXONOMY (FRAGMENT) str_exchange_buy_schedule_code yes Send str_forfeitures_distribution_waiver_code yes Send str_hardship_distribution_waiver_code yes Send str_involuntary_distribution_waiver_code yes Send str_loan_distribution_waiver_code yes Send str_minimum_amt yes Send str_payout_calculation_amt_code no NID Exclude str_plan_fee_waiver_code yes Send str_qualified_distribution_waiver_age yes Send str_qualified_distribution_waiver_code yes Send str_rate_{0} yes Send 8
  • 9. GOALS: • Structure SEC filings in order to allow an easier and customized navigation within documents; • Interactive table of contents; • Links between funds, share classes and relevant chapters; • Extract financial data points from Prospectus/SAI/(Sticker?) documents; • Better and faster exploitation of information; • Comparison of extracted data points with data from Profile; • Structured data export to different formats  support for user defined analysis on top of the extracted data; • Maintain provenance link Filing MFIPSEC US 9
  • 10. THE ANALYSIS PROCESS 1. Daily import - new filings are imported from the SEC platform; 2. Each filing (package containing several documents) is split into documents; 3. XBRL, Prospectus (485BPOS), SAI(497) and Sticker (497) documents are further analyzed: a) XBRL is automatically parsed and extracted data can be visualized using a dedicated user interface (www.mfip.info) b) Working on a query system using JSONiq c) Prospectus, SAI and Sticker documents go through an iterative semi- automatic processing chain that allows their 2 level structuring:  Division into chapters for an easy navigation;  Data extraction for an easy data exploitation. 10
  • 11. DOCUMENT STRUCTURING – PHASE 1  Parsing  The filing is divided into Document Parts, Sections and Chapters;  Funds described in each Document Part are identified together with their list of share classes.  Linking  Links between sections and share classes are created  Validation  The two previous steps are checked by a human expert and corrected if necessary. 11
  • 12. PHASE 1 – FILING PARSING PROCESSOriginalSECFiling-hasafilingclasssuchas 485APOS,485BPOS,497,497K,etc. Document Part 1 Section 1.1 Chapter list 1.1 … … Section 1.n Chapter list 1.n Fund 1.1 Share class list 1.1 … Fund 1.k Share class list 1.k … … … Document Part k Section k.1 Chapter list k.1 … Section k.r Chapter list k.m Fund k.1 Share class list k.1 Fund k.2 Share class list k.2 The number and types of Document Parts contained within a filing depends on the filing class. There are currently 10 document types available: N-CSR, N-CSRS, SAI Sticker, SAI, Summary Sticker, Summary Prospectus, Statutory Prospectus, Statutory Sticker, Statutory Prospectus, N-Q, 24F-2NT - e.g One Statutory Prospectus identified within the current document part - e.g. One Summary Prospectus identified within the current document part - e.g One SAI identified within the current document part - e.g. All the Summary Prospectuses identified within the current filing - e.g. All the SAIs identified within the current filing - e.g. All the Statutory Prospectuses identified within the current filing 12
  • 13. Filing type List of funds for the current filing List of Share Classes for the current filing 13
  • 14. Document Part 1 Document Part 2 Section 1.1 Section 1.2 Section 1.4 Section 2.1 Section 2.2 Chapters 14
  • 15. Table of contents automatically detected using available templates Navigable links are automatically created between the chapter name and the chapter start index in the document 15
  • 16. DOCUMENT STRUCTURING – PHASE 1  Parsing  The filing is divided into Document Parts, Sections and Chapters;  Funds described in each Document Part are identified together with their list of share classes.  Linking  Links between sections and share classes are created  Validation  The two previous steps are checked by a human expert and corrected if necessary. 16
  • 17. PHASE 1 – THE LINKING PROCESSOriginalSECFiling-hasafilingclasssuchas 485APOS,485BPOS,497,497K,etc. Document Part 1 Section 1.1 Chapter list 1.1 … … Section 1.n Chapter list 1.n Fund 1.1 Share class list 1.1 … Fund 1.k Share class list 1.k … … … Document Part k Section k.1 Chapter list k.1 … Section k.r Chapter list k.m Fund k.1 Share class list k.1 Fund k.2 Share class list k.2 From a flatter structure towards… 17
  • 18. OriginalSECFiling-hasafiling classsuchas485APOS,485BPOS, 497,497K,etc. Document Part 1 Section 1.1 Fund 1.1.1 Share class list 1.1.1 … … Fund 1.1.k Share class list 1.1.k Chapter list 1.1 … … … Section 1.n Fund 1.n.1 Share class list 1.n.1 Fund 1.n.k Share class list 1.n.k Chapter list 1.n … … … … … … … .. Document Part k Section k.1 Fund k.1.1 Share class list k.1.1 Fund k.1.2 Share class list k.1.2 Chapter list k.1 … Section k.r Fund k.m.1 Share class list k.m.1 Chapter list k.m - All the Statutory Prospectuses identified within the current filing - All the SAIs identified within the current filing - All the Summary Prospectuses identified within the current filing - One Statutory Prospectus identified within the current document part - One Summary Prospectus identified within the current document part - One SAI identified within the current document part PHASE 1 – THE LINKING PROCESS A hierarchical structure Hierarchical links are created between Sections and Funds 18
  • 19. Each section has associated a list of share classes 19
  • 20. DOCUMENT STRUCTURING – PHASE 1  Parsing  The filing is divided into Document Parts, Sections and Chapters;  Funds described in each Document Part are identified together with their list of share classes.  Linking  Links between sections and share classes are created  Validation  The two previous steps are checked by a human expert and corrected if necessary. 20
  • 21. PHASE 2 – DATA EXTRACTION 2. Data extraction from Prospectus, SAI and Sticker documents:  Analysis of complex documents (200-300 pages of unstructured English text) and extraction of relevant features of the mutual funds  The extraction concentrates on a taxonomy of 260 data points representing features or characteristics of mutual funds  A hybrid data extraction approach  Rule based linguistic analysis  Custom extracted - data propagation algorithm  Check data point context from historic document and propagate value if contexts match  Custom table parsers based on  Regular expressions  Machine learning 21
  • 22. CLASS Y ; IVLCX / Class C ; IENAX / Class A ; CLASS C ; IBUTX / Class B ; CLASS B ; CLASS A ; CLASS C ; CLASS Y ; IAUYX / Class Y ; MGAVX / CLASS B ; ITHCX / Class C ; CLASS C ; ITYAX / Class A ; FLISX / Investor Class ; CLASS B ; CLASS R5 ; IEFCX / Class C ; CLASS A ; MSVCX / CLASS C ; CLASS C ; ILSBX / Class B ; ILSYX / Class Y ; CLASS A ; MSAJX / CLASS R5 ; CLASS B ; IENIX / CLASS R5 ; CLASS B ; MSAVX / CLASS A ; MSARX / CLASS R ; FSTUX / Investor Class ; ITYYX / Class Y ; CLASS A ; FGLDX / Investor Class ; IGDAX / Class A ; CLASS Y ; IENYX / Class Y ; ITYBX / Class B ; IENBX / Class B ; IUTCX / Class C ; Class R ; CLASS R5 ; CLASS Y ; IGDYX / Class Y ; FTPIX / CLASS R5 ; IGDCX / Class C ; FSTEX / Investor Class ; ILSRX / Class R ; MSAIX / CLASS Y ; CLASS B ; CLASS R5 ; IGDBX / Class B ; IAUTX / Class A ; CLASS R ; ILSAX / Class A ; FSIUX / CLASS R5 ; CLASS R ; CLASS Y ; CLASS C ; CLASS A ; FTCHX / Investor Class ; (61 Share Class IDs) HTML Filing Prospectus 1 Prospectus 2 Prospectus 3 Prospectus 15 Acc nb: 0000950123-12-011157 SAI 1 SAI 2 … IENAX / Class A ; IENBX / Class B; FSTEX / Investor Class; IENYX / Class Y; IEFCX / Class C; CLASS C; CLASS A; CLASS B;CLASS Y; IGDBX / Class B; IGDCX / Class C; IGDYX / Class Y; IGDAX / Class A; FGLDX / Investor Class; ILSAX / Class A; ILSRX / Class R; ILSYX / Class Y; ILSBX / Class B; FLISX / Investor Class; IVLCX / Class C; CLASS R5 ITHCX / Class C; ITYAX / Class A; ITYYX / Class Y; ITYBX / Class B; FTPIX / CLASS R5; FTCHX / Investor Class; IBUTX / Class B; IAUYX / Class Y; FSTUX / Investor Class; IUTCX / Class C; IAUTX / Class A; FSIUX / CLASS R5; IVLCX / Class C; FLISX / Investor Class; ILSBX / Class B; ILSYX / Class Y; ILSRX / Class R; ILSAX / Class A; FGLDX / Investor Class; IGDAX / Class A; IGDYX / Class Y; IGDCX / Class C; IGDBX / Class B; IENAX / Class A; IEFCX / Class C; IENIX / CLASS R5; IENYX / Class Y; IENBX / Class B; FSTEX / Investor Class; CLASS C; CLASS A; CLASS B; CLASS Y; MGAVX / CLASS B; MSVCX / CLASS C; MSAJX / CLASS R5; MSAVX / CLASS A; MSARX / CLASS R; MSAIX / CLASS Y; CLASS Y; CLASS A; CLASS C; CLASS B; CLASS R5; CLASS R; CLASS Y; CLASS C; CLASS R5; CLASS B; CLASS R; CLASS A; CLASS A; CLASS B; CLASS Y; CLASS C; CLASS B; CLASS C; CLASS A; Class R; CLASS Y; CLASS R5; Step 1 - Filings are parsed  Prospectus and SAI doc linked with the Share Classes they describe MFIP Parsing + Linking 22
  • 23. Step 2 - Prospectus and SAIs are grouped based on the common share classes Prospectus 1 IENAX / Class A ; IENBX / Class B; FSTEX / Investor Class; IENYX / Class Y; IEFCX / Class C; CLASS C; CLASS A; CLASS B;CLASS Y; SAI 1 ITHCX / Class C; ITYAX / Class A; ITYYX / Class Y; ITYBX / Class B; FTPIX / CLASS R5; FTCHX / Investor Class; IBUTX / Class B; IAUYX / Class Y; FSTUX / Investor Class; IUTCX / Class C; IAUTX / Class A; FSIUX / CLASS R5; IVLCX / Class C; FLISX / Investor Class; ILSBX / Class B; ILSYX / Class Y; ILSRX / Class R; ILSAX / Class A; FGLDX / Investor Class; IGDAX / Class A; IGDYX / Class Y; IGDCX / Class C; IGDBX / Class B; IENAX / Class A; IEFCX / Class C; IENIX / CLASS R5; IENYX / Class Y; IENBX / Class B; FSTEX / Investor Class; Prospectus 1 IENAX / Class A ; IENBX / Class B; FSTEX / Investor Class; IENYX / Class Y; IEFCX / Class C; CLASS C; CLASS A; CLASS B;CLASS Y; SAI 2 CLASS C; CLASS A; CLASS B; CLASS Y; MGAVX / CLASS B; MSVCX / CLASS C; MSAJX / CLASS R5; MSAVX / CLASS A; MSARX / CLASS R; MSAIX / CLASS Y; CLASS Y; CLASS A; CLASS C; CLASS B; CLASS R5; CLASS R; CLASS Y; CLASS C; CLASS R5; CLASS B; CLASS R; CLASS A; CLASS A; CLASS B; CLASS Y; CLASS C; CLASS B; CLASS C; CLASS A; Class R; CLASS Y; CLASS R5; Document Pair 1 Document Pair 2 … Document Pair n 23
  • 24. Prospectus 1 SAI 1 NLP/Table Parser Extraction ENGINE ROA Calculation Methodology Code •01 •Prospectus, Start Index 337811, Length 277 •Prospectus, Start Index 338089, Length 193 Link by Spouse Code •Yes •SAI, Start Index 19201, Length 2092 • No •Prospectus. Start Index 25789, Length 1203 •… … … … Step 3 – Paired Prospectus and SAI documents are automatically analyzed and a set of data points are extracted No linking between Extracted Data Points and Share Class ID!!! IENAX / Class A ; IENBX / Class B; FSTEX / Investor Class; IENYX / Class Y; IEFCX / Class C Document Pair 1 24
  • 25. Step 4 – Extracted data is validated, missing values are searched and added if available. All values are linked to Share Classes. 25
  • 27. DATA EXTRACTION APPROACH  Linguistic analysis  Cogito Studio – Expert System Italy  Extraction rules written in 4GL  Custom table parsers  Based on syntactic rules  Based on machine learning  Data propagation algorithm  Historic data – automatically or manually extracted – can be used to guide/improve the extraction process  If value together with its context is the same as in old document it is automatically propagated to new document, automatically linked and validated;  If value changed but context is the same, the new value is extracted automatically, linked and validated. 27
  • 28. EXAMPLE OF TEXT FROM WHICH A FEATURE IS EXTRACTED – UNDERLINED IN YELLOW 28
  • 29. THE EXTRACTION IN PREVIOUS EXAMPLES WAS DONE USING THE FOLLOWING RULE The IDs are used to identify the semantics associated to a word (the syncon of a word). Several primitives are available that can refer to sets of words senses (syncons) with very close meanings. 29
  • 30. THE EXTRACTION RESULTS ARE VISUALIZED IN A CMS BY FINANCIAL EXPERTS THAT CAN VALIDATE OR MODIFY THE VALUE 30
  • 31. VALIDATED EXTRACTION RESULTS ARE DELIVERED TO OUR CUSTOMERS Field taxonomy Extracted value The text from which the value was extracted 31
  • 32. DATA POINTS CLASSIFICATION AND EXTRACTION FROM TABLES. A MACHINE LEARNING APPROACH 32
  • 33. EXTRACT DATA POINTS FROM TABLES USING MACHINE LEARNING 33  Dataset creation and Model building  Tables identification  Dataset construction from tables  Create models based on the dataset  Assign label Yes/No to each of the instances  Data classification  Identify data point values based on the learned models  Performance evaluation
  • 34. EXTRACT DATA POINTS FROM TABLES USING MACHINE LEARNING System design 34
  • 35. TABLE IDENTIFICATION  Given as input a Prospectus, identify all tables  Classify each table in one of the categories of interest: IRA, Fees and Expenses, Distribution Reinvestment 35 Table type Fees and Expenses Tables Distribution Reinvestment Tables IRA Tables
  • 36. EXTRACT CONTEXT FROM TABLES  Context Extraction  Identify horizontal context  Characters and words at the left of the position of the value  Identify vertical context  Define window surrounding the value  Select characters and words in the window above the value  The context in the example: {account, Individual, Initial, investment, retirement}  After applying preprocessing and stemming: {account, individu, init, investm, retir} Vertical context Horizontal context 36
  • 37. TEXT PREPROCESSING  Stop words removal  Words that carry no relevant information to our problem including conjunctions and prepositions  Punctuation and numeric values elimination  Stemming  Algorithm used: Lovins Stemmer  Algorithm description:  294 endings, 29 conditions, 35 transformation rules  Steps:  Step 1: find the longest ending, associate it with a condition and remove it  Step 2: apply the transformation whether or not the ending was removed  Why use stemming:  Normalization step: different forms of a word are set to the same stem  Improve classification accuracy  Reduce number of attributes in the dataset 37
  • 38. STEMMING EXAMPLE  Problem: identify the field Minimum_IRA_Initial_Purchase_Amount in the table:  The definition of the field contains the words {account, individual, initial, investment, retirement}  Field In Table = {accounts, individual, initial, minimum, no, retirement}  The similarity between the definition and the field is: 2.23  After applying the stemmer:  Definition of field = {account, individu, init, investm, retir}  Field In Table = {account, individu, init, minim, retir, no}  The similarity between the definition and the field becomes: 1.71 they are more similar 38
  • 39. CONTEXT AS INSTANCE  0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, Minimum_IRA_Initial_Purchase_Amount  Context {account, individu, init, investm, retir} 39 individu retir account init investm coverdell subsequen t
  • 41. DATASET CONSTRUCTION AND MODEL BUILDING  Create dataset from processed documents for each data point  Examples representation as bag of words: each unique word in the examples becomes an attribute in the dataset  Dataset content  Context {account, individu, init, investm, retir}  Class  data point category {Minimum_IRA_Initial_Purchase_Amount}  Build model  Apply Random Forest  algorithm for classification of instances  consists in constructing a multitude of Decision Trees  New instance classification:  Run down on all trees  classification accuracy from each tree  Classification accuracy result is computed as the average of the previously obtained accuracies 41
  • 42. 42
  • 43. DATA POINTS FROM TABLES USING MACHINE LEARNING - STEPS  Dataset creation and Model building  Tables identification  Dataset construction from tables  Create models based on dataset  Assign label Yes/No to each of the instances  Data classification  Identify data point values based on the learned models  Performance evaluation 43
  • 44. HOW THE APPROACH IS USED FOR A NEW PROSPECTUS  Identify tables and classify them with table types  Load models according to the types  For each possible value identified in the table, extract context (including stemming)  Create a new instance with the context and unknown class {context, ?}  Run instance through all models  values for accuracy are obtained for each model  The label of the instance will be the same as the model for which the greatest membership was obtained 44
  • 45. HOW THE APPROACH IS USED FOR A NEW PROSPECTUS - EXAMPLE IRA table Identified and extracted value Vertical context Horizontal context 45
  • 46. PERFORMANCE EVALUATION  Evaluate performance of the system by computing the classification accuracy on unseen data  Classification accuracy refers to the ability of the model to correctly predict the class label of new unseen data  Reported as %  Model accuracy refers to the degree of fitting the data and the model  We computed the accuracy for each of the models we built 46
  • 47. MODEL ACCURACY FOR IRA DATA POINTS 47  The classification accuracy varies between 41% and 85% due to:  the numerous ways in which a data point can be expressed in the tables (the various ways of expressing each data point)  the number of examples used for training the model  the context extraction correctness
  • 49. CONTACT  Business Information: Drew Warren – Recognos Financial New York dwarren@recognosfinancial.com  Technical Information George Roth – Recognos Inc. – San Francisco groth@recognos.com The system can be seen at: www.mfip.info The presentation will be published on Slide Share by George Roth 49