SlideShare a Scribd company logo
Text Mining and Continuous Assurance
Kevin Moffitt
Text Mining and Continuous Assurance
Continuous Assurance
• Allows for the automated and frequent review of business data
• Current focus is on the structured data
– General ledgers
– Financial statements
– XBRL
• However, we cannot ignore the information found in unstructured
data
– Textual data, for example narrative portion of financial disclosures
• Up to 85% of the data in financial disclosures is in the form of text
Text Mining and Continuous Assurance
Text Mining
• Many methods for extracting data from text
• One popular method is to use dictionaries/word lists
• E.g. Dictionary to identify positive language in business
documents…
SATISFIES
PREEMINENT
REWARDED
BENEFITTING
SOLVING
COLLABORATIONS
BOOST
TREMENDOUS
GREATEST
PERFECTLY
DELIGHTING
COMPLIMENTING
EXCITING
REBOUNDED
CONCLUSIVE
ASSURE
INNOVATED
ENJOYING
CREATIVE
GREATLY
Text Mining and Continuous Assurance
Drawbacks of Dictionary Method
• Single words
– Context Free
– Naïve
Text Mining and Continuous Assurance
Lexical Bundles
• Frequent multi-word sequences in a given corpus (e.g. financial
reports, history journals, biology journals)
• More context in phrases than in individual words
• Criteria for identifying lexical bundles
– Sequences of words four words or longer
– Occurred in at least 15% unique documents
– Occurred at a rate of at least 20 times per million words
Example Lexical Bundles from Annual
Reports
the fair value of
be adversely affected by
as a percentage of
assets and liabilities and
Text Mining and Continuous Assurance
Lexical Bundles
• Research objective - Use Lexical Bundles to discriminate between
Fraudulent and Non-fraudulent Financial Reports
Text Mining and Continuous Assurance
Research Questions
• RQ1: What are the most frequently used lexical bundles in fraudulent and
non-fraudulent Management Discussion and Analysis section (MD&A) of
annual reports?
• RQ2: Which lexical bundles are used at a considerably different rate in
fraudulent and non-fraudulent MD&As?
• RQ3: Can lexical bundles be used to classify fraudulent and non-fraudulent
MD&As at a rate greater than chance?
Text Mining and Continuous Assurance
Sample Selection
• Identified 101 fraudulent annual reports (10-Ks) from set of SEC
investigations
• Analyzed the Management Discussion and Analysis (MD&A) section
of 10-K
– Gives investors view of company from management’s perspective
– contains some of the least structured language in the 10-K
– Most read part of 10-K
Text Mining and Continuous Assurance
Sample Selection
Sample selection criteria for fraudulent 10-Ks
Companies identified as fraudulent by
searching through AAERs 141
Disqualified because fraud did not involve 10-
Ks (20)
Disqualified because 10-K was not available
from the EDGAR DB (10)
Disqualified because 10-K did not contain
management discussion section (10)
Final count of qualifying fraudulent 10-Ks used
in the sample 101
Text Mining and Continuous Assurance
Sample Selection—Types of Fraud
Type of Fraud Companies
Overstatement of revenues 44
Combination of overstating revenue and
understating expenses
25
Disclosure issue 10
Overstatement of inventory 6
Other income increasing effects 6
Understatement of provisions for loan-
loss reserves
5
Other 5
Text Mining and Continuous Assurance
Sample Selection – Non-Fraudulent sample
• 101 Matching Non-Fraudulent 10-Ks were identified
Text Mining and Continuous Assurance
Results
Text Mining and Continuous Assurance
Lexical Bundle Identification
• 560 Lexical Bundles were identified
Text Mining and Continuous Assurance
Creative Accounting
Lexical Bundle
Fraud Bundles Per
Million Words
NonFraud Bundles
Per Million Words
%
difference
in process
research and
development
199 76 160%
goodwill and other
intangible assets
121 82 47%
Text Mining and Continuous Assurance
Big Bath Charges
• Wholesale aggressive restructuring to improve
cost and expense structure for the future
– Disposition of long-lived assets
Lexical Bundle
Fraud Bundles Per
Million Words
NonFraud Bundles
Per Million Words
%
difference
disposition of long
lived assets and
49 21 139%
Text Mining and Continuous Assurance
Fair Value Accounting
• Subjective method for assigning value to an asset
– Change value of assets
– Understate debt obligations
– Misrepresent foreign currency exchange adjustments
Lexical Bundle
Fraud Bundles
Per Million Words
NonFraud Bundles
Per Million Words
%
difference
the fair value of
257 171 50%
in foreign
currency
exchange
41 21 97%
Text Mining and Continuous Assurance
Lexical Bundles used more Frequently in Non-Fraudulent
MD&As
• Conservative language for accounting practices
Lexical Bundle
Fraud Bundles Per
Million Words
NonFraud Bundles
Per Million Words
%
difference
to continue as a
going concern
15 91 513%
disclosures about
market risk
85 115 36%
material impact on
the
38 52 35%
Text Mining and Continuous Assurance
Principal Component Analysis
• Variable reduction procedure
– Combines correlated variables into principal components
• Principal components
– First component accounts for maximum amount of total variance in the
observed variables
– Components are uncorrelated
• Components are made up of correlated variables
– Overlapping lexical bundles are combined
Correlated bundles transformed into one principal component
4-word bundles  6-word component
there can be no
there can be no assurance
can be no assurance
there can be no
assurance that
can be no assurance that
be no assurance that
Text Mining and Continuous Assurance
Principal Component Analysis
• 560 Lexical Bundles were reduced to 88 principal
components
Text Mining and Continuous Assurance
Component 1
principles generally accepted in
accounting principles generally accepted
generally accepted in the
accepted in the united
with accounting principles generally
affect the reported amounts
reported amounts of assets
that affect the reported
to make estimates and
factors that could cause
actual results to differ
results to differ materially
of assets and liabilities
actual results may differ
to differ materially from
differ materially from those
forward looking statements this
in the united states
allowance for doubtful accounts
are expected to be
company believes that the
Text Mining and Continuous Assurance
Component 1
with accounting principles generally accepted in the united states
that affect the reported amounts of assets and liabilities
are expected to be
company believes that the
to make estimates and
factors that could cause
forward looking statements this
allowance for doubtful
accounts
actual results to
actual results may
differ materially from those
“GAAP and expected results”
Text Mining and Continuous Assurance
Component 2
have a material adverse
material adverse effect on
a material adverse effect
adverse effect on the
business financial condition and
could have a material
effect on the company's
can be no assurance
be no assurance that
there can be no
assurance that the company
of one or more
the company will be
no assurance that the
of the company's products
that the company will
and will continue to
Text Mining and Continuous Assurance
Component 2
could have a material adverse effect on the company's
there can be no assurance that the company will be
business financial condition
and
of one or more
of the company's products
and will continue to
“Could be bad”
Text Mining and Continuous Assurance
Classification Results
• Discriminant Analysis
– 71% of cross-validated cases were correctly
classified
Discriminating factor (PC) Beta Discriminating factor (PC) Beta
Impact and exposure .464 Price and offsets .335
Material difference -.421 COGS and change
in accounting
principle
.330
Common stock and
adverse affects
.412 Fair market value .313
Going concerns .363 Exercise of stock
Options
.298
New product
introductions
.339 Number of Factors -.287
Text Mining and Continuous Assurance
Confusion Matrix
Predicted Class
Fraudulent Non-Fraudulent
Actual Class
Fraudulent 70 31
Non-Fraudulent 28 73
Text Mining and Continuous Assurance
Confusion Matrix Results
FNFPTNTP
TNTP
Accuracy
TNFP
FP
FPR
FNTP
TP
TPR
FPTP
TP
ecision








Pr
Precision = .714
True Positive Rate = .693
False Positive Rate = .277
Accuracy = .708
Predicted Class
Fraudulent Non-Fraudulent
Actual
Class
Fraudulent 70 (TP) 31 (FN)
Non-Fraudulent 28 (FP) 73 (TN)
Text Mining and Continuous Assurance
Conclusion
• Lexical bundles have more contextual meaning than unigrams
– Results are easier to interpret
• Lexical bundles may be used to classify documents
• Lexical bundle analysis can be used in any type of textual dataset
• This process and other text mining processes can be integrated into
continuous assurance solutions
– Rapid identification of suspicious documents

More Related Content

Similar to Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Continuous Monitoring Webinar Aviva Spectrum
Continuous Monitoring Webinar Aviva SpectrumContinuous Monitoring Webinar Aviva Spectrum
Continuous Monitoring Webinar Aviva Spectrum
Aviva Spectrum™
 
Chapter 14 certificationsIT Framework standards
Chapter 14 certificationsIT Framework standardsChapter 14 certificationsIT Framework standards
Chapter 14 certificationsIT Framework standards
EstelaJeffery653
 
The Evolving Role of the Chief Compliance Officer
The Evolving Role of the Chief Compliance OfficerThe Evolving Role of the Chief Compliance Officer
The Evolving Role of the Chief Compliance Officer
Convercent
 
Benchmark - Effective Approaches in Leadership and Management  .docx
Benchmark - Effective Approaches in Leadership and Management  .docxBenchmark - Effective Approaches in Leadership and Management  .docx
Benchmark - Effective Approaches in Leadership and Management  .docx
tangyechloe
 
How to Justify a Change in Your ALLL
How to Justify a Change in Your ALLLHow to Justify a Change in Your ALLL
How to Justify a Change in Your ALLL
Libby Bierman
 
A Belt and Suspenders Approach to Chart Audit and Coding
A Belt and Suspenders Approach to Chart Audit and CodingA Belt and Suspenders Approach to Chart Audit and Coding
A Belt and Suspenders Approach to Chart Audit and Coding
Altegra Health
 
Succeeding with Customer Interaction Analytics
Succeeding with Customer Interaction AnalyticsSucceeding with Customer Interaction Analytics
Succeeding with Customer Interaction Analytics
Contact Centre Management Group
 
CECL Methodology - Forecasting
CECL Methodology - ForecastingCECL Methodology - Forecasting
CECL Methodology - Forecasting
Libby Bierman
 
Continous auditing and risk monitoring 9 23-09
Continous auditing and risk monitoring  9 23-09Continous auditing and risk monitoring  9 23-09
Continous auditing and risk monitoring 9 23-09
Gaiani (CarnCorpAudit)
 
Q Factors: How to Justify in Periods of Low Loss
Q Factors: How to Justify in Periods of Low LossQ Factors: How to Justify in Periods of Low Loss
Q Factors: How to Justify in Periods of Low Loss
Libby Bierman
 
9 Borland Solo Pruebas 2009
9 Borland Solo Pruebas 20099 Borland Solo Pruebas 2009
9 Borland Solo Pruebas 2009
Pepe
 
Enhancing Disclosure with Plain Language
Enhancing Disclosure with Plain LanguageEnhancing Disclosure with Plain Language
Enhancing Disclosure with Plain Language
SkyLaw Professional Corporation
 
Due Diligence-Financial & Operations Risk Analysis & Assessment
Due Diligence-Financial & Operations Risk Analysis & AssessmentDue Diligence-Financial & Operations Risk Analysis & Assessment
Due Diligence-Financial & Operations Risk Analysis & Assessment
Tony Wayne
 
1001205101
10012051011001205101
1001205101
veriskir
 
Service Matters Ideas Lab Slide Deck
Service Matters Ideas Lab Slide DeckService Matters Ideas Lab Slide Deck
Service Matters Ideas Lab Slide Deck
Stephen Kerr
 
Business Value Measurements and the Solution Design Framework
Business Value Measurements and the Solution Design FrameworkBusiness Value Measurements and the Solution Design Framework
Business Value Measurements and the Solution Design Framework
Leo Barella
 
Chad Kluemper
Chad KluemperChad Kluemper
Chad Kluemper
Chad Kluemper
 
Manage Your Organization's Contract Risks Final
Manage Your Organization's Contract Risks FinalManage Your Organization's Contract Risks Final
Manage Your Organization's Contract Risks Final
Fred Travis
 
Lateral Due Diligence and Integration
Lateral Due Diligence and IntegrationLateral Due Diligence and Integration
Lateral Due Diligence and Integration
Eric Dewey
 
Valuation Insights: Second Quarter 2017
Valuation Insights: Second Quarter 2017Valuation Insights: Second Quarter 2017
Valuation Insights: Second Quarter 2017
Duff & Phelps
 

Similar to Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS (20)

Continuous Monitoring Webinar Aviva Spectrum
Continuous Monitoring Webinar Aviva SpectrumContinuous Monitoring Webinar Aviva Spectrum
Continuous Monitoring Webinar Aviva Spectrum
 
Chapter 14 certificationsIT Framework standards
Chapter 14 certificationsIT Framework standardsChapter 14 certificationsIT Framework standards
Chapter 14 certificationsIT Framework standards
 
The Evolving Role of the Chief Compliance Officer
The Evolving Role of the Chief Compliance OfficerThe Evolving Role of the Chief Compliance Officer
The Evolving Role of the Chief Compliance Officer
 
Benchmark - Effective Approaches in Leadership and Management  .docx
Benchmark - Effective Approaches in Leadership and Management  .docxBenchmark - Effective Approaches in Leadership and Management  .docx
Benchmark - Effective Approaches in Leadership and Management  .docx
 
How to Justify a Change in Your ALLL
How to Justify a Change in Your ALLLHow to Justify a Change in Your ALLL
How to Justify a Change in Your ALLL
 
A Belt and Suspenders Approach to Chart Audit and Coding
A Belt and Suspenders Approach to Chart Audit and CodingA Belt and Suspenders Approach to Chart Audit and Coding
A Belt and Suspenders Approach to Chart Audit and Coding
 
Succeeding with Customer Interaction Analytics
Succeeding with Customer Interaction AnalyticsSucceeding with Customer Interaction Analytics
Succeeding with Customer Interaction Analytics
 
CECL Methodology - Forecasting
CECL Methodology - ForecastingCECL Methodology - Forecasting
CECL Methodology - Forecasting
 
Continous auditing and risk monitoring 9 23-09
Continous auditing and risk monitoring  9 23-09Continous auditing and risk monitoring  9 23-09
Continous auditing and risk monitoring 9 23-09
 
Q Factors: How to Justify in Periods of Low Loss
Q Factors: How to Justify in Periods of Low LossQ Factors: How to Justify in Periods of Low Loss
Q Factors: How to Justify in Periods of Low Loss
 
9 Borland Solo Pruebas 2009
9 Borland Solo Pruebas 20099 Borland Solo Pruebas 2009
9 Borland Solo Pruebas 2009
 
Enhancing Disclosure with Plain Language
Enhancing Disclosure with Plain LanguageEnhancing Disclosure with Plain Language
Enhancing Disclosure with Plain Language
 
Due Diligence-Financial & Operations Risk Analysis & Assessment
Due Diligence-Financial & Operations Risk Analysis & AssessmentDue Diligence-Financial & Operations Risk Analysis & Assessment
Due Diligence-Financial & Operations Risk Analysis & Assessment
 
1001205101
10012051011001205101
1001205101
 
Service Matters Ideas Lab Slide Deck
Service Matters Ideas Lab Slide DeckService Matters Ideas Lab Slide Deck
Service Matters Ideas Lab Slide Deck
 
Business Value Measurements and the Solution Design Framework
Business Value Measurements and the Solution Design FrameworkBusiness Value Measurements and the Solution Design Framework
Business Value Measurements and the Solution Design Framework
 
Chad Kluemper
Chad KluemperChad Kluemper
Chad Kluemper
 
Manage Your Organization's Contract Risks Final
Manage Your Organization's Contract Risks FinalManage Your Organization's Contract Risks Final
Manage Your Organization's Contract Risks Final
 
Lateral Due Diligence and Integration
Lateral Due Diligence and IntegrationLateral Due Diligence and Integration
Lateral Due Diligence and Integration
 
Valuation Insights: Second Quarter 2017
Valuation Insights: Second Quarter 2017Valuation Insights: Second Quarter 2017
Valuation Insights: Second Quarter 2017
 

More from TECSI FEA USP

12th CONTECSI USP - Guia para publicar Andre Jun Emerald
12th CONTECSI USP - Guia para publicar  Andre Jun Emerald12th CONTECSI USP - Guia para publicar  Andre Jun Emerald
12th CONTECSI USP - Guia para publicar Andre Jun Emerald
TECSI FEA USP
 
12 contecsi IT Management GAESI USP Rastreabilidade de Medicamentos - Elcio...
12 contecsi  IT Management GAESI USP  Rastreabilidade de Medicamentos - Elcio...12 contecsi  IT Management GAESI USP  Rastreabilidade de Medicamentos - Elcio...
12 contecsi IT Management GAESI USP Rastreabilidade de Medicamentos - Elcio...
TECSI FEA USP
 
12 contecsi Workshop Mendeley Ligia Capobianco
12 contecsi   Workshop Mendeley Ligia Capobianco12 contecsi   Workshop Mendeley Ligia Capobianco
12 contecsi Workshop Mendeley Ligia Capobianco
TECSI FEA USP
 
Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...
Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...
Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...
TECSI FEA USP
 
Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI
Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI
Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI
TECSI FEA USP
 
Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI
 Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI   Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI
Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI
TECSI FEA USP
 
The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...
The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...
The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...
TECSI FEA USP
 
Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI
 Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI  Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI
Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI
TECSI FEA USP
 
Balance Innovations in Backoffice Improvement and Service Delivery A study ca...
Balance Innovations in Backoffice Improvement and Service Delivery A study ca...Balance Innovations in Backoffice Improvement and Service Delivery A study ca...
Balance Innovations in Backoffice Improvement and Service Delivery A study ca...
TECSI FEA USP
 
Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS
TECSI FEA USP
 
Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...
Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...
Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...
TECSI FEA USP
 
GAESI - Gestão em Automação e TI - 12th CONTECSI
GAESI - Gestão em Automação e TI - 12th CONTECSI GAESI - Gestão em Automação e TI - 12th CONTECSI
GAESI - Gestão em Automação e TI - 12th CONTECSI
TECSI FEA USP
 
Co-production: an opportunity toward better digital governance - 12th CONTECSI
 Co-production: an opportunity toward better digital governance - 12th CONTECSI  Co-production: an opportunity toward better digital governance - 12th CONTECSI
Co-production: an opportunity toward better digital governance - 12th CONTECSI
TECSI FEA USP
 
The Digital Transformation - Challenges and Opportunities for IS researchers ...
The Digital Transformation - Challenges and Opportunities for IS researchers ...The Digital Transformation - Challenges and Opportunities for IS researchers ...
The Digital Transformation - Challenges and Opportunities for IS researchers ...
TECSI FEA USP
 
Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...
Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...
Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...
TECSI FEA USP
 
Big (huge) Data and a continuous and predictive audit: new evidence, new met...
 Big (huge) Data and a continuous and predictive audit: new evidence, new met... Big (huge) Data and a continuous and predictive audit: new evidence, new met...
Big (huge) Data and a continuous and predictive audit: new evidence, new met...
TECSI FEA USP
 
Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS
 Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS
Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS
TECSI FEA USP
 
O Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARS
O Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARSO Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARS
O Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARS
TECSI FEA USP
 
Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...
Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...
Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...
TECSI FEA USP
 
Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...
Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...
Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...
TECSI FEA USP
 

More from TECSI FEA USP (20)

12th CONTECSI USP - Guia para publicar Andre Jun Emerald
12th CONTECSI USP - Guia para publicar  Andre Jun Emerald12th CONTECSI USP - Guia para publicar  Andre Jun Emerald
12th CONTECSI USP - Guia para publicar Andre Jun Emerald
 
12 contecsi IT Management GAESI USP Rastreabilidade de Medicamentos - Elcio...
12 contecsi  IT Management GAESI USP  Rastreabilidade de Medicamentos - Elcio...12 contecsi  IT Management GAESI USP  Rastreabilidade de Medicamentos - Elcio...
12 contecsi IT Management GAESI USP Rastreabilidade de Medicamentos - Elcio...
 
12 contecsi Workshop Mendeley Ligia Capobianco
12 contecsi   Workshop Mendeley Ligia Capobianco12 contecsi   Workshop Mendeley Ligia Capobianco
12 contecsi Workshop Mendeley Ligia Capobianco
 
Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...
Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...
Planejamento e gestão de indicadores para projetos digitais - Workshop 12th C...
 
Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI
Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI
Programa de Apoio às Publicações Científicas Periódicas da USP - 12th CONTECSI
 
Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI
 Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI   Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI
Centro Integrado de Mobilidade Urbana CIMU - 12th CONTECSI
 
The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...
The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...
The ladder of Citizens ‘smartness’ Citizens participation in smart cities - 1...
 
Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI
 Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI  Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI
Papel de los vocabularios semánticos en la economía en red - 12th CONTECSI
 
Balance Innovations in Backoffice Improvement and Service Delivery A study ca...
Balance Innovations in Backoffice Improvement and Service Delivery A study ca...Balance Innovations in Backoffice Improvement and Service Delivery A study ca...
Balance Innovations in Backoffice Improvement and Service Delivery A study ca...
 
Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS Automated Clustering Project - 12th CONTECSI 34th WCARS
Automated Clustering Project - 12th CONTECSI 34th WCARS
 
Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...
Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...
Sistema Autenticador e Transmissor (SAT): modelo tecnológico de automação e c...
 
GAESI - Gestão em Automação e TI - 12th CONTECSI
GAESI - Gestão em Automação e TI - 12th CONTECSI GAESI - Gestão em Automação e TI - 12th CONTECSI
GAESI - Gestão em Automação e TI - 12th CONTECSI
 
Co-production: an opportunity toward better digital governance - 12th CONTECSI
 Co-production: an opportunity toward better digital governance - 12th CONTECSI  Co-production: an opportunity toward better digital governance - 12th CONTECSI
Co-production: an opportunity toward better digital governance - 12th CONTECSI
 
The Digital Transformation - Challenges and Opportunities for IS researchers ...
The Digital Transformation - Challenges and Opportunities for IS researchers ...The Digital Transformation - Challenges and Opportunities for IS researchers ...
The Digital Transformation - Challenges and Opportunities for IS researchers ...
 
Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...
Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...
Auditoria Contínua e o Sistema de Controle da Administração Pública Federal -...
 
Big (huge) Data and a continuous and predictive audit: new evidence, new met...
 Big (huge) Data and a continuous and predictive audit: new evidence, new met... Big (huge) Data and a continuous and predictive audit: new evidence, new met...
Big (huge) Data and a continuous and predictive audit: new evidence, new met...
 
Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS
 Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS
Federal Audit in Relation to Continuous Audit - 12th CONTECSI 34th WCARS
 
O Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARS
O Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARSO Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARS
O Tribunal de Contas da União e a Auditoria Contínua - 12th CONTECSI 34th WCARS
 
Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...
Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...
Cenário atual da Auditoria Contínua em Bancos no Brasil Itaú-Unibanco Holding...
 
Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...
Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...
Auditoria Eletrônica: Automatização de procedimentos de auditoria através do ...
 

Recently uploaded

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 

Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

  • 1. Text Mining and Continuous Assurance Kevin Moffitt
  • 2. Text Mining and Continuous Assurance Continuous Assurance • Allows for the automated and frequent review of business data • Current focus is on the structured data – General ledgers – Financial statements – XBRL • However, we cannot ignore the information found in unstructured data – Textual data, for example narrative portion of financial disclosures • Up to 85% of the data in financial disclosures is in the form of text
  • 3. Text Mining and Continuous Assurance Text Mining • Many methods for extracting data from text • One popular method is to use dictionaries/word lists • E.g. Dictionary to identify positive language in business documents… SATISFIES PREEMINENT REWARDED BENEFITTING SOLVING COLLABORATIONS BOOST TREMENDOUS GREATEST PERFECTLY DELIGHTING COMPLIMENTING EXCITING REBOUNDED CONCLUSIVE ASSURE INNOVATED ENJOYING CREATIVE GREATLY
  • 4. Text Mining and Continuous Assurance Drawbacks of Dictionary Method • Single words – Context Free – Naïve
  • 5. Text Mining and Continuous Assurance Lexical Bundles • Frequent multi-word sequences in a given corpus (e.g. financial reports, history journals, biology journals) • More context in phrases than in individual words • Criteria for identifying lexical bundles – Sequences of words four words or longer – Occurred in at least 15% unique documents – Occurred at a rate of at least 20 times per million words Example Lexical Bundles from Annual Reports the fair value of be adversely affected by as a percentage of assets and liabilities and
  • 6. Text Mining and Continuous Assurance Lexical Bundles • Research objective - Use Lexical Bundles to discriminate between Fraudulent and Non-fraudulent Financial Reports
  • 7. Text Mining and Continuous Assurance Research Questions • RQ1: What are the most frequently used lexical bundles in fraudulent and non-fraudulent Management Discussion and Analysis section (MD&A) of annual reports? • RQ2: Which lexical bundles are used at a considerably different rate in fraudulent and non-fraudulent MD&As? • RQ3: Can lexical bundles be used to classify fraudulent and non-fraudulent MD&As at a rate greater than chance?
  • 8. Text Mining and Continuous Assurance Sample Selection • Identified 101 fraudulent annual reports (10-Ks) from set of SEC investigations • Analyzed the Management Discussion and Analysis (MD&A) section of 10-K – Gives investors view of company from management’s perspective – contains some of the least structured language in the 10-K – Most read part of 10-K
  • 9. Text Mining and Continuous Assurance Sample Selection Sample selection criteria for fraudulent 10-Ks Companies identified as fraudulent by searching through AAERs 141 Disqualified because fraud did not involve 10- Ks (20) Disqualified because 10-K was not available from the EDGAR DB (10) Disqualified because 10-K did not contain management discussion section (10) Final count of qualifying fraudulent 10-Ks used in the sample 101
  • 10. Text Mining and Continuous Assurance Sample Selection—Types of Fraud Type of Fraud Companies Overstatement of revenues 44 Combination of overstating revenue and understating expenses 25 Disclosure issue 10 Overstatement of inventory 6 Other income increasing effects 6 Understatement of provisions for loan- loss reserves 5 Other 5
  • 11. Text Mining and Continuous Assurance Sample Selection – Non-Fraudulent sample • 101 Matching Non-Fraudulent 10-Ks were identified
  • 12. Text Mining and Continuous Assurance Results
  • 13. Text Mining and Continuous Assurance Lexical Bundle Identification • 560 Lexical Bundles were identified
  • 14. Text Mining and Continuous Assurance Creative Accounting Lexical Bundle Fraud Bundles Per Million Words NonFraud Bundles Per Million Words % difference in process research and development 199 76 160% goodwill and other intangible assets 121 82 47%
  • 15. Text Mining and Continuous Assurance Big Bath Charges • Wholesale aggressive restructuring to improve cost and expense structure for the future – Disposition of long-lived assets Lexical Bundle Fraud Bundles Per Million Words NonFraud Bundles Per Million Words % difference disposition of long lived assets and 49 21 139%
  • 16. Text Mining and Continuous Assurance Fair Value Accounting • Subjective method for assigning value to an asset – Change value of assets – Understate debt obligations – Misrepresent foreign currency exchange adjustments Lexical Bundle Fraud Bundles Per Million Words NonFraud Bundles Per Million Words % difference the fair value of 257 171 50% in foreign currency exchange 41 21 97%
  • 17. Text Mining and Continuous Assurance Lexical Bundles used more Frequently in Non-Fraudulent MD&As • Conservative language for accounting practices Lexical Bundle Fraud Bundles Per Million Words NonFraud Bundles Per Million Words % difference to continue as a going concern 15 91 513% disclosures about market risk 85 115 36% material impact on the 38 52 35%
  • 18. Text Mining and Continuous Assurance Principal Component Analysis • Variable reduction procedure – Combines correlated variables into principal components • Principal components – First component accounts for maximum amount of total variance in the observed variables – Components are uncorrelated • Components are made up of correlated variables – Overlapping lexical bundles are combined Correlated bundles transformed into one principal component 4-word bundles  6-word component there can be no there can be no assurance can be no assurance there can be no assurance that can be no assurance that be no assurance that
  • 19. Text Mining and Continuous Assurance Principal Component Analysis • 560 Lexical Bundles were reduced to 88 principal components
  • 20. Text Mining and Continuous Assurance Component 1 principles generally accepted in accounting principles generally accepted generally accepted in the accepted in the united with accounting principles generally affect the reported amounts reported amounts of assets that affect the reported to make estimates and factors that could cause actual results to differ results to differ materially of assets and liabilities actual results may differ to differ materially from differ materially from those forward looking statements this in the united states allowance for doubtful accounts are expected to be company believes that the
  • 21. Text Mining and Continuous Assurance Component 1 with accounting principles generally accepted in the united states that affect the reported amounts of assets and liabilities are expected to be company believes that the to make estimates and factors that could cause forward looking statements this allowance for doubtful accounts actual results to actual results may differ materially from those “GAAP and expected results”
  • 22. Text Mining and Continuous Assurance Component 2 have a material adverse material adverse effect on a material adverse effect adverse effect on the business financial condition and could have a material effect on the company's can be no assurance be no assurance that there can be no assurance that the company of one or more the company will be no assurance that the of the company's products that the company will and will continue to
  • 23. Text Mining and Continuous Assurance Component 2 could have a material adverse effect on the company's there can be no assurance that the company will be business financial condition and of one or more of the company's products and will continue to “Could be bad”
  • 24. Text Mining and Continuous Assurance Classification Results • Discriminant Analysis – 71% of cross-validated cases were correctly classified Discriminating factor (PC) Beta Discriminating factor (PC) Beta Impact and exposure .464 Price and offsets .335 Material difference -.421 COGS and change in accounting principle .330 Common stock and adverse affects .412 Fair market value .313 Going concerns .363 Exercise of stock Options .298 New product introductions .339 Number of Factors -.287
  • 25. Text Mining and Continuous Assurance Confusion Matrix Predicted Class Fraudulent Non-Fraudulent Actual Class Fraudulent 70 31 Non-Fraudulent 28 73
  • 26. Text Mining and Continuous Assurance Confusion Matrix Results FNFPTNTP TNTP Accuracy TNFP FP FPR FNTP TP TPR FPTP TP ecision         Pr Precision = .714 True Positive Rate = .693 False Positive Rate = .277 Accuracy = .708 Predicted Class Fraudulent Non-Fraudulent Actual Class Fraudulent 70 (TP) 31 (FN) Non-Fraudulent 28 (FP) 73 (TN)
  • 27. Text Mining and Continuous Assurance Conclusion • Lexical bundles have more contextual meaning than unigrams – Results are easier to interpret • Lexical bundles may be used to classify documents • Lexical bundle analysis can be used in any type of textual dataset • This process and other text mining processes can be integrated into continuous assurance solutions – Rapid identification of suspicious documents