SlideShare a Scribd company logo
1
DEEP LEARNING FOR
UNDERSTANDING OF
UNSTRUCTURED
DOCUMENTS
MAY 27, 2017
2
DMITRII NIKITKO
I work for EPAM`s big data competence center approximately an one year and at
least two years I involved into data science and machine learning activity.
From this time forward I work for many data science projects and competitions
and have strong production experience in this area.
Dmitrii Nikitko
EPAM Systems, Data Analyst
3
WHAT IS DOCUMENT UNDERSTANDING ?
Printed document understanding -
automatic extraction of information
Invoice number Invoice date Company
INV0010 16.02.2012 Electrical
I1083 08.14.2008 Software
Challenges:
• Different layouts, languages, vendors
• Various schema (set of typed items)
4
Project challenges
• Invalid documents (bad quality, non-invoice)
• Multi page documents
• Mistakes in database
• Can be used only open source software
Example documents in a sample
5
Proposed solution
Invoice: INV24816
Date: 21.10.2015
Company: Vector
Pipeline features:
• Uses open source software
• Dockerized
• Modular architecture
• Trainer and evaluator included
• Interactive debugging
PDF
Box
Tesseract
OCR
Tensorflow
Keras
Solution features:
• Works with different layouts, languages, vendors
• Can be easily extended to work with new fields
• End-to-end processing
• Gives a probability of recognition error
• High accuracy
6
Annotating documents
Annotating documents – creation a labeled
bounding boxes above target fields.
Total processed docs 14000
People involved 20
Time spent 3 month
Processing steps:
1. Sample documents from various vendors
2. Process documents via annotators
3. Validate processed documents
Modified tool for documents labelling
7
Possible approaches
Hand-written rules Statistical models
Pros:
• Interpretable decision
• Usually faster
Cons:
• Domain knowledge is required
• Time-conscious
• Quite difficult to maintain
Pros:
• Features are extracted automatically
• Robust to unfamiliar or erroneous input
• Usually higher quality
Cons:
• Labelled dataset is required
• ML specialist is required
8
Classical statistical approach
1. Recognize document, split document in text blocks
2. Associate document with schema
3. Use actual classifier across text blocks
4. Select block which has higher score
Text block Score
Total due: 0.0
$14.900 0.9
Tax: 0.0
$100 0.1
Shema 1 classifier scores
9
End-to-end deep neural network
Invoice: INV24816
Date: 21.10.2015
Company: Vector
10
Recurrent neural network for the youngest
Full connected
neural network
Simple recurrent
neural network
Long-short term
memory
• State-less
• Short dependencies
• Long dependencies
11
Simple RNN vs Minimal Gated Unit
X1 X2 X3 X4 X5
H e l l o
A 0 0 0 0 0
B 0 0 0 0 0
C 0 0 0 0 0
D 0 0 0 0 0
E 0 1 0 0 0
… 0 0 0 0 0
H 1 0 0 0 0
… 0 0 0 0 0
L 0 0 1 1 0
Input time-steps
One-hot encoded
Simple unrolled recurrent neural network
Hidden
state
12
RNN: Point of view
The neuron highlighted in this image seems to get very excited about URLs and turns off outside of the
URLs. The LSTM is likely using this neuron to remember if it is inside a URL or not.
The Unreasonable Effectiveness of Recurrent Neural Networks
13
Sequence to sequence approach
It takes an input sequence (document content,
character by character) and outputs another
sequence (character by character).
During model training encoder tries to learn
patterns in sequence.
Decoder trains to transform hidden state of encoder
into value corresponding to Ground Truth.
Example:
Input sequence:
Vector incorporated. Buy date: 11-20-2016.
Invoice data: 11-31-2016 …more info…
Neural network output:
31.11.16
14
Example:
Input sequence:
Vector inc. Tax amount: $26. Total amount: $146.22. Total Due:3..
Neural network output:
0000000000000000000000000000000000000000111110000000000000
Sequence to mask approach
It takes an input sequence (document content, character by character) and outputs
probability over input sequence.
Bidirectional
RNN
Bidirectional
RNN
MASK
Time
distributed FC
with sigmoid
Architecture:
15
PDF converter
OCR
Model
Post-processing
Errors funnel on test set
Errors
• Model chose incorrect value
• Values in output sequence too low
• Incorrect text from OCR
• OCR totally didn`t recognize text block
• Incorrect post-processing
• Incorrect chose from multi-page docs
• Error in converting PDF
• Corrupted output image
3%
8%
8%
2%
100 %
97 %
89 %
81 %
79 %
16
Problems of development and delivery
Customer
insight
Selecting
metrics
Reproducible
experiments
Co-workers
activity
Alternative
approachesBugs
Quality of
data
Rules of machine learning - best practices in ML from Google
Sacred is a tool to configure, organize, log and
reproduce computational experiments
17
THANK YOU
QUESTIONS?
Dmitrii_Nikitko@epam.com

More Related Content

Similar to #ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of unstructured documents"

Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Guglielmo Iozzia
 
IoT meets Big Data
IoT meets Big DataIoT meets Big Data
IoT meets Big Data
ratthaslip ranokphanuwat
 
TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)
Mike Felch
 
Denodo DataFest 2017: Conquering the Edge with Data Virtualization
Denodo DataFest 2017: Conquering the Edge with Data VirtualizationDenodo DataFest 2017: Conquering the Edge with Data Virtualization
Denodo DataFest 2017: Conquering the Edge with Data Virtualization
Denodo
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Mahdi Hosseini Moghaddam
 
Test Tool for Industrial Ethernet Network Performance (June 2009)
Test Tool for Industrial Ethernet Network Performance (June 2009)Test Tool for Industrial Ethernet Network Performance (June 2009)
Test Tool for Industrial Ethernet Network Performance (June 2009)
Jim Gilsinn
 
Big Data Analytics Tokyo
Big Data Analytics TokyoBig Data Analytics Tokyo
Big Data Analytics Tokyo
Adam Gibson
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
HostedbyConfluent
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
TigerGraph
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
HPCC Systems
 
cv2
cv2cv2
cv2
sree p
 
Entity embeddings for categorical data
Entity embeddings for categorical dataEntity embeddings for categorical data
Entity embeddings for categorical data
Paul Skeie
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
StampedeCon
 
Competency center set up. Theory and Practice
Competency center set up. Theory and PracticeCompetency center set up. Theory and Practice
Competency center set up. Theory and Practice
zolik
 
Self driving computers active learning workflows with human interpretable ve...
Self driving computers  active learning workflows with human interpretable ve...Self driving computers  active learning workflows with human interpretable ve...
Self driving computers active learning workflows with human interpretable ve...
Adam Gibson
 
Digital Document Preservation Simulation - Boston Python User's Group
Digital Document  Preservation Simulation - Boston Python User's GroupDigital Document  Preservation Simulation - Boston Python User's Group
Digital Document Preservation Simulation - Boston Python User's Group
Micah Altman
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6
Rod Soto
 
Network security monitoring elastic webinar - 16 june 2021
Network security monitoring   elastic webinar - 16 june 2021Network security monitoring   elastic webinar - 16 june 2021
Network security monitoring elastic webinar - 16 june 2021
Mouaz Alnouri
 
Intelligence at scale through AI model efficiency
Intelligence at scale through AI model efficiencyIntelligence at scale through AI model efficiency
Intelligence at scale through AI model efficiency
Qualcomm Research
 
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
TEST Huddle
 

Similar to #ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of unstructured documents" (20)

Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
IoT meets Big Data
IoT meets Big DataIoT meets Big Data
IoT meets Big Data
 
TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)
 
Denodo DataFest 2017: Conquering the Edge with Data Virtualization
Denodo DataFest 2017: Conquering the Edge with Data VirtualizationDenodo DataFest 2017: Conquering the Edge with Data Virtualization
Denodo DataFest 2017: Conquering the Edge with Data Virtualization
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Test Tool for Industrial Ethernet Network Performance (June 2009)
Test Tool for Industrial Ethernet Network Performance (June 2009)Test Tool for Industrial Ethernet Network Performance (June 2009)
Test Tool for Industrial Ethernet Network Performance (June 2009)
 
Big Data Analytics Tokyo
Big Data Analytics TokyoBig Data Analytics Tokyo
Big Data Analytics Tokyo
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
 
cv2
cv2cv2
cv2
 
Entity embeddings for categorical data
Entity embeddings for categorical dataEntity embeddings for categorical data
Entity embeddings for categorical data
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
Competency center set up. Theory and Practice
Competency center set up. Theory and PracticeCompetency center set up. Theory and Practice
Competency center set up. Theory and Practice
 
Self driving computers active learning workflows with human interpretable ve...
Self driving computers  active learning workflows with human interpretable ve...Self driving computers  active learning workflows with human interpretable ve...
Self driving computers active learning workflows with human interpretable ve...
 
Digital Document Preservation Simulation - Boston Python User's Group
Digital Document  Preservation Simulation - Boston Python User's GroupDigital Document  Preservation Simulation - Boston Python User's Group
Digital Document Preservation Simulation - Boston Python User's Group
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6
 
Network security monitoring elastic webinar - 16 june 2021
Network security monitoring   elastic webinar - 16 june 2021Network security monitoring   elastic webinar - 16 june 2021
Network security monitoring elastic webinar - 16 june 2021
 
Intelligence at scale through AI model efficiency
Intelligence at scale through AI model efficiencyIntelligence at scale through AI model efficiency
Intelligence at scale through AI model efficiency
 
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
 

More from epamspb

Mobile Open Day: React Native: Crossplatform fast dive
Mobile Open Day: React Native: Crossplatform fast diveMobile Open Day: React Native: Crossplatform fast dive
Mobile Open Day: React Native: Crossplatform fast dive
epamspb
 
Mobile Open Day: Things I wish I'd known about Core Data before getting married
Mobile Open Day: Things I wish I'd known about Core Data before getting marriedMobile Open Day: Things I wish I'd known about Core Data before getting married
Mobile Open Day: Things I wish I'd known about Core Data before getting married
epamspb
 
#ITsubbotnik Spring 2017: Sergey Chibirev/Andrei Ortyashov "Умный дом своими ...
#ITsubbotnik Spring 2017: Sergey Chibirev/Andrei Ortyashov "Умный дом своими ...#ITsubbotnik Spring 2017: Sergey Chibirev/Andrei Ortyashov "Умный дом своими ...
#ITsubbotnik Spring 2017: Sergey Chibirev/Andrei Ortyashov "Умный дом своими ...
epamspb
 
#ITsubbotnik Spring 2017: Stepan Rakitin "Создаем отказоустойчивые распределе...
#ITsubbotnik Spring 2017: Stepan Rakitin "Создаем отказоустойчивые распределе...#ITsubbotnik Spring 2017: Stepan Rakitin "Создаем отказоустойчивые распределе...
#ITsubbotnik Spring 2017: Stepan Rakitin "Создаем отказоустойчивые распределе...
epamspb
 
#ITsubbotnik Spring 2017: Rustam Kadyrov "Как приструнить зоопарк из микросер...
#ITsubbotnik Spring 2017: Rustam Kadyrov "Как приструнить зоопарк из микросер...#ITsubbotnik Spring 2017: Rustam Kadyrov "Как приструнить зоопарк из микросер...
#ITsubbotnik Spring 2017: Rustam Kadyrov "Как приструнить зоопарк из микросер...
epamspb
 
ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...
ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...
ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...
epamspb
 
#ITsubbotnik Spring 2017: Sergey Chernolyas "JPA for NoSQL"
#ITsubbotnik Spring 2017: Sergey Chernolyas "JPA for NoSQL"#ITsubbotnik Spring 2017: Sergey Chernolyas "JPA for NoSQL"
#ITsubbotnik Spring 2017: Sergey Chernolyas "JPA for NoSQL"
epamspb
 
#ITsubbotnik Spring 2017: Roman Iovlev "Java edge in test automation"
#ITsubbotnik Spring 2017: Roman Iovlev "Java edge in test automation"#ITsubbotnik Spring 2017: Roman Iovlev "Java edge in test automation"
#ITsubbotnik Spring 2017: Roman Iovlev "Java edge in test automation"
epamspb
 
#ITsubbotnik Spring 2017: Roman Dimitrenko "Building Paas with the HashiStack"
#ITsubbotnik Spring 2017: Roman Dimitrenko "Building Paas with the HashiStack"#ITsubbotnik Spring 2017: Roman Dimitrenko "Building Paas with the HashiStack"
#ITsubbotnik Spring 2017: Roman Dimitrenko "Building Paas with the HashiStack"
epamspb
 
#ITsubbotnik Spring 2017: Mikhail Khludnev "Search like %SQL%"
#ITsubbotnik Spring 2017: Mikhail Khludnev "Search like %SQL%"#ITsubbotnik Spring 2017: Mikhail Khludnev "Search like %SQL%"
#ITsubbotnik Spring 2017: Mikhail Khludnev "Search like %SQL%"
epamspb
 
#ITsubbotnik Spring 2017: Andriy Filatov "Ансамбль солёных поваров: сравнивае...
#ITsubbotnik Spring 2017: Andriy Filatov "Ансамбль солёных поваров: сравнивае...#ITsubbotnik Spring 2017: Andriy Filatov "Ансамбль солёных поваров: сравнивае...
#ITsubbotnik Spring 2017: Andriy Filatov "Ансамбль солёных поваров: сравнивае...
epamspb
 
#ITsubbotnik Spring 2017: Anton Shapin, Denis Klykov "Visualization, storage ...
#ITsubbotnik Spring 2017: Anton Shapin, Denis Klykov "Visualization, storage ...#ITsubbotnik Spring 2017: Anton Shapin, Denis Klykov "Visualization, storage ...
#ITsubbotnik Spring 2017: Anton Shapin, Denis Klykov "Visualization, storage ...
epamspb
 
#ITsubbotnik Spring 2017: Sergey Mishanin "Report Portal. Руководство для аде...
#ITsubbotnik Spring 2017: Sergey Mishanin "Report Portal. Руководство для аде...#ITsubbotnik Spring 2017: Sergey Mishanin "Report Portal. Руководство для аде...
#ITsubbotnik Spring 2017: Sergey Mishanin "Report Portal. Руководство для аде...
epamspb
 

More from epamspb (13)

Mobile Open Day: React Native: Crossplatform fast dive
Mobile Open Day: React Native: Crossplatform fast diveMobile Open Day: React Native: Crossplatform fast dive
Mobile Open Day: React Native: Crossplatform fast dive
 
Mobile Open Day: Things I wish I'd known about Core Data before getting married
Mobile Open Day: Things I wish I'd known about Core Data before getting marriedMobile Open Day: Things I wish I'd known about Core Data before getting married
Mobile Open Day: Things I wish I'd known about Core Data before getting married
 
#ITsubbotnik Spring 2017: Sergey Chibirev/Andrei Ortyashov "Умный дом своими ...
#ITsubbotnik Spring 2017: Sergey Chibirev/Andrei Ortyashov "Умный дом своими ...#ITsubbotnik Spring 2017: Sergey Chibirev/Andrei Ortyashov "Умный дом своими ...
#ITsubbotnik Spring 2017: Sergey Chibirev/Andrei Ortyashov "Умный дом своими ...
 
#ITsubbotnik Spring 2017: Stepan Rakitin "Создаем отказоустойчивые распределе...
#ITsubbotnik Spring 2017: Stepan Rakitin "Создаем отказоустойчивые распределе...#ITsubbotnik Spring 2017: Stepan Rakitin "Создаем отказоустойчивые распределе...
#ITsubbotnik Spring 2017: Stepan Rakitin "Создаем отказоустойчивые распределе...
 
#ITsubbotnik Spring 2017: Rustam Kadyrov "Как приструнить зоопарк из микросер...
#ITsubbotnik Spring 2017: Rustam Kadyrov "Как приструнить зоопарк из микросер...#ITsubbotnik Spring 2017: Rustam Kadyrov "Как приструнить зоопарк из микросер...
#ITsubbotnik Spring 2017: Rustam Kadyrov "Как приструнить зоопарк из микросер...
 
ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...
ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...
ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...
 
#ITsubbotnik Spring 2017: Sergey Chernolyas "JPA for NoSQL"
#ITsubbotnik Spring 2017: Sergey Chernolyas "JPA for NoSQL"#ITsubbotnik Spring 2017: Sergey Chernolyas "JPA for NoSQL"
#ITsubbotnik Spring 2017: Sergey Chernolyas "JPA for NoSQL"
 
#ITsubbotnik Spring 2017: Roman Iovlev "Java edge in test automation"
#ITsubbotnik Spring 2017: Roman Iovlev "Java edge in test automation"#ITsubbotnik Spring 2017: Roman Iovlev "Java edge in test automation"
#ITsubbotnik Spring 2017: Roman Iovlev "Java edge in test automation"
 
#ITsubbotnik Spring 2017: Roman Dimitrenko "Building Paas with the HashiStack"
#ITsubbotnik Spring 2017: Roman Dimitrenko "Building Paas with the HashiStack"#ITsubbotnik Spring 2017: Roman Dimitrenko "Building Paas with the HashiStack"
#ITsubbotnik Spring 2017: Roman Dimitrenko "Building Paas with the HashiStack"
 
#ITsubbotnik Spring 2017: Mikhail Khludnev "Search like %SQL%"
#ITsubbotnik Spring 2017: Mikhail Khludnev "Search like %SQL%"#ITsubbotnik Spring 2017: Mikhail Khludnev "Search like %SQL%"
#ITsubbotnik Spring 2017: Mikhail Khludnev "Search like %SQL%"
 
#ITsubbotnik Spring 2017: Andriy Filatov "Ансамбль солёных поваров: сравнивае...
#ITsubbotnik Spring 2017: Andriy Filatov "Ансамбль солёных поваров: сравнивае...#ITsubbotnik Spring 2017: Andriy Filatov "Ансамбль солёных поваров: сравнивае...
#ITsubbotnik Spring 2017: Andriy Filatov "Ансамбль солёных поваров: сравнивае...
 
#ITsubbotnik Spring 2017: Anton Shapin, Denis Klykov "Visualization, storage ...
#ITsubbotnik Spring 2017: Anton Shapin, Denis Klykov "Visualization, storage ...#ITsubbotnik Spring 2017: Anton Shapin, Denis Klykov "Visualization, storage ...
#ITsubbotnik Spring 2017: Anton Shapin, Denis Klykov "Visualization, storage ...
 
#ITsubbotnik Spring 2017: Sergey Mishanin "Report Portal. Руководство для аде...
#ITsubbotnik Spring 2017: Sergey Mishanin "Report Portal. Руководство для аде...#ITsubbotnik Spring 2017: Sergey Mishanin "Report Portal. Руководство для аде...
#ITsubbotnik Spring 2017: Sergey Mishanin "Report Portal. Руководство для аде...
 

Recently uploaded

Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Envertis Software Solutions
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
Peter Muessig
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
aymanquadri279
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
VALiNTRY360
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
GohKiangHock
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 

Recently uploaded (20)

Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfTop Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdf
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 

#ITsubbotnik Spring 2017: Dmitrii Nikitko "Deep learning for understanding of unstructured documents"

  • 1. 1 DEEP LEARNING FOR UNDERSTANDING OF UNSTRUCTURED DOCUMENTS MAY 27, 2017
  • 2. 2 DMITRII NIKITKO I work for EPAM`s big data competence center approximately an one year and at least two years I involved into data science and machine learning activity. From this time forward I work for many data science projects and competitions and have strong production experience in this area. Dmitrii Nikitko EPAM Systems, Data Analyst
  • 3. 3 WHAT IS DOCUMENT UNDERSTANDING ? Printed document understanding - automatic extraction of information Invoice number Invoice date Company INV0010 16.02.2012 Electrical I1083 08.14.2008 Software Challenges: • Different layouts, languages, vendors • Various schema (set of typed items)
  • 4. 4 Project challenges • Invalid documents (bad quality, non-invoice) • Multi page documents • Mistakes in database • Can be used only open source software Example documents in a sample
  • 5. 5 Proposed solution Invoice: INV24816 Date: 21.10.2015 Company: Vector Pipeline features: • Uses open source software • Dockerized • Modular architecture • Trainer and evaluator included • Interactive debugging PDF Box Tesseract OCR Tensorflow Keras Solution features: • Works with different layouts, languages, vendors • Can be easily extended to work with new fields • End-to-end processing • Gives a probability of recognition error • High accuracy
  • 6. 6 Annotating documents Annotating documents – creation a labeled bounding boxes above target fields. Total processed docs 14000 People involved 20 Time spent 3 month Processing steps: 1. Sample documents from various vendors 2. Process documents via annotators 3. Validate processed documents Modified tool for documents labelling
  • 7. 7 Possible approaches Hand-written rules Statistical models Pros: • Interpretable decision • Usually faster Cons: • Domain knowledge is required • Time-conscious • Quite difficult to maintain Pros: • Features are extracted automatically • Robust to unfamiliar or erroneous input • Usually higher quality Cons: • Labelled dataset is required • ML specialist is required
  • 8. 8 Classical statistical approach 1. Recognize document, split document in text blocks 2. Associate document with schema 3. Use actual classifier across text blocks 4. Select block which has higher score Text block Score Total due: 0.0 $14.900 0.9 Tax: 0.0 $100 0.1 Shema 1 classifier scores
  • 9. 9 End-to-end deep neural network Invoice: INV24816 Date: 21.10.2015 Company: Vector
  • 10. 10 Recurrent neural network for the youngest Full connected neural network Simple recurrent neural network Long-short term memory • State-less • Short dependencies • Long dependencies
  • 11. 11 Simple RNN vs Minimal Gated Unit X1 X2 X3 X4 X5 H e l l o A 0 0 0 0 0 B 0 0 0 0 0 C 0 0 0 0 0 D 0 0 0 0 0 E 0 1 0 0 0 … 0 0 0 0 0 H 1 0 0 0 0 … 0 0 0 0 0 L 0 0 1 1 0 Input time-steps One-hot encoded Simple unrolled recurrent neural network Hidden state
  • 12. 12 RNN: Point of view The neuron highlighted in this image seems to get very excited about URLs and turns off outside of the URLs. The LSTM is likely using this neuron to remember if it is inside a URL or not. The Unreasonable Effectiveness of Recurrent Neural Networks
  • 13. 13 Sequence to sequence approach It takes an input sequence (document content, character by character) and outputs another sequence (character by character). During model training encoder tries to learn patterns in sequence. Decoder trains to transform hidden state of encoder into value corresponding to Ground Truth. Example: Input sequence: Vector incorporated. Buy date: 11-20-2016. Invoice data: 11-31-2016 …more info… Neural network output: 31.11.16
  • 14. 14 Example: Input sequence: Vector inc. Tax amount: $26. Total amount: $146.22. Total Due:3.. Neural network output: 0000000000000000000000000000000000000000111110000000000000 Sequence to mask approach It takes an input sequence (document content, character by character) and outputs probability over input sequence. Bidirectional RNN Bidirectional RNN MASK Time distributed FC with sigmoid Architecture:
  • 15. 15 PDF converter OCR Model Post-processing Errors funnel on test set Errors • Model chose incorrect value • Values in output sequence too low • Incorrect text from OCR • OCR totally didn`t recognize text block • Incorrect post-processing • Incorrect chose from multi-page docs • Error in converting PDF • Corrupted output image 3% 8% 8% 2% 100 % 97 % 89 % 81 % 79 %
  • 16. 16 Problems of development and delivery Customer insight Selecting metrics Reproducible experiments Co-workers activity Alternative approachesBugs Quality of data Rules of machine learning - best practices in ML from Google Sacred is a tool to configure, organize, log and reproduce computational experiments

Editor's Notes

  1. Specific Video Links: Corporate Presentations: -          DPES position – bit dated but this material landed us in lead category for Forrester wave - https://videoportal.epam.com/video/LoEpvyad -          Global Locations – Elaina Shekhter & Eli Feldman - https://videoportal.epam.com/video/noZmMgaq -          Data Intelligence Practice – Max Bogretsov - https://videoportal.epam.com/video/7oWwGGoP -          Digital Services – Kevin Labick - https://videoportal.epam.com/video/2RdLp8Rd   Industry Presentations -          Energy Practice – Alan Harlan, John Chu - https://videoportal.epam.com/video/qay0ByJl -          Healthcare – Alan Harlan, Daniel Estrada, Jacob Jesson, Yuriy Gankin - https://videoportal.epam.com/video/9o36K2al -          Digital in Retail – Kevin Labick, Valeri Makovic, Dan Kjaergaard, Pedro Silva, & Robin Smith - https://videoportal.epam.com/video/BR0XWbRl   Focus Areas/Products -          IoT – Eli Feldman - https://videoportal.epam.com/video/nR49ZzJD & https://videoportal.epam.com/video/nR4X2ARD + https://videoportal.epam.com/video/BR0X1bRl -          API Strategy – Max & Pavel Veller - https://videoportal.epam.com/video/xazMdLJG -          Cloud Services – Eli Feldman - https://videoportal.epam.com/video/7oWlpvRP -          Blockchain – FB1, Ilya Romanov, Roman Polupanov - https://videoportal.epam.com/video/Koe9n7oq -          DevOps – Eli & Dmitry Tikhomirov - https://videoportal.epam.com/video/LoBm2ba9 -          InfoNgen – Serge Velychko - https://videoportal.epam.com/video/PaK27VJ0
  2. Specific Video Links: Corporate Presentations: -          DPES position – bit dated but this material landed us in lead category for Forrester wave - https://videoportal.epam.com/video/LoEpvyad -          Global Locations – Elaina Shekhter & Eli Feldman - https://videoportal.epam.com/video/noZmMgaq -          Data Intelligence Practice – Max Bogretsov - https://videoportal.epam.com/video/7oWwGGoP -          Digital Services – Kevin Labick - https://videoportal.epam.com/video/2RdLp8Rd   Industry Presentations -          Energy Practice – Alan Harlan, John Chu - https://videoportal.epam.com/video/qay0ByJl -          Healthcare – Alan Harlan, Daniel Estrada, Jacob Jesson, Yuriy Gankin - https://videoportal.epam.com/video/9o36K2al -          Digital in Retail – Kevin Labick, Valeri Makovic, Dan Kjaergaard, Pedro Silva, & Robin Smith - https://videoportal.epam.com/video/BR0XWbRl   Focus Areas/Products -          IoT – Eli Feldman - https://videoportal.epam.com/video/nR49ZzJD & https://videoportal.epam.com/video/nR4X2ARD + https://videoportal.epam.com/video/BR0X1bRl -          API Strategy – Max & Pavel Veller - https://videoportal.epam.com/video/xazMdLJG -          Cloud Services – Eli Feldman - https://videoportal.epam.com/video/7oWlpvRP -          Blockchain – FB1, Ilya Romanov, Roman Polupanov - https://videoportal.epam.com/video/Koe9n7oq -          DevOps – Eli & Dmitry Tikhomirov - https://videoportal.epam.com/video/LoBm2ba9 -          InfoNgen – Serge Velychko - https://videoportal.epam.com/video/PaK27VJ0