SlideShare a Scribd company logo
1 of 31
Internal
Ease the pain of Travel Refunds with AI
Generali Group - Analytics Solutions Center
Belgrade, 18th September 2018
Context
2
When I say
“travel expense”
-
what do you think about?
Context
3
Challenge
8000 employees
600.000 hours of employee time
+ 600.000 hours of HR lost per year
…and a lot of frustration
Divide the process time by 10
RefundAnalytics
Solution - Create an automatic travel expenses classifier integrated
with a smartphone bot
5
Telegram Bot
HR System
Receipts from
business trips OCR
RefundAnalytics
Solution - Create an automatic travel expenses classifier integrated
with a smartphone bot
6
Telegram Bot
HR System
Receipts from
business trips OCR
Solution - Create an automatic travel expenses classifier integrated
with a smartphone bot
7
HR System
*Categories: Restaurant, Transport (except taxi), Taxi,
City tax, Mixed type (Hotel: City tax + Restaurant)
Category: Restaurant
Amount: 42.75
Date: 27/02/2018
Currency: Euro
Solution - Create an automatic travel expenses classifier integrated
with a smartphone bot
8
Telegram Bot
HR System
Receipts from
business trips RefundOCR Analytics
OCR process – Test available tools
9
Tesseract Google API Microsoft API
Amount extracted 85.2% 92.7% 64.7%
Date extracted 65.5% 86.3% 42.6%
Tesseract: Pre-processing – psm, rotate, resize, denoising, removing lines,
contrast
Google: Sending images to Google servers
5sec 1.8sec
From OCR to meaningful words
10
Flatten the boxes to line-by-line text
Replace misspellings and lemmatize
Predicting the category of the receipt
11
Taxi
Transport
(except taxi)
Restaurant
City tax
Mixed type (city
tax + meal)
Predicting the category of the receipt
12
FONTANA SRL
VIA DEI CROCIFERI 12-13
00187 ROMA
P.IVA 08656441006
TEL 06.69925682
EURO
5 X 1,50
7,50
PANE E_O SERVIZI 2,50
ACQUA NATURALE
ACQUA MINERALELL 2,50
4 X 3,50
FIORI ZUCCA 14,00
4 X 11,00
TONNAR CAC_PEPE 44,00
RIGAT AMATRICIA 11,00
3 X 1,00
CAFFE CONV 3,00
CAFFE CONV 1,00
SUB-TOTALE 85,50
SCONTO -42,75
#TAV 75 – SHARON
TOTALE EURO 42.75 42,13
CONTANTI 42,75
NR.0012
27/02/18 14:11
MU1 72017007 PP 3,55
Category:
Restaurant
Category
probability:
98%
TFIDF
input
Random
Forest
Use TFIDF features to build category model from text
Predicting the category of the receipt
13
Let’s play a game.
Can you guess the category of this
receipt based on the text extracted?
Predicting the category of the receipt
14
Which category?
Dal 1960 al vostro servizio
a ricerca satellitare
Milano PAGATA CON02.8585 a ricerca satellitare
Milano 8/3 67
Credit Card
Signature Firma OS. € From/to Globix cartaceo
Abbonamento Ranamento Dala Anali
RISTORANTE PIZZERIA CAFFE MIE
Esente IVA art. 10 N. 14 del 26.10.1972 N. 633
Contante
Predicting the category of the receipt
15
It’s a taxi receipt
Predicting the category of the receipt
16
Use TFIDF features and
add xception features from xception network to the model
FONTANA SRL
VIA DEI CROCIFERI 12-13
00187 ROMA
P.IVA 08656441006
TEL 06.69925682
EURO
5 X 1,50
7,50
PANE E_O SERVIZI 2,50
ACQUA NATURALE
ACQUA MINERALELL 2,50
4 X 3,50
FIORI ZUCCA 14,00
4 X 11,00
TONNAR CAC_PEPE 44,00
RIGAT AMATRICIA 11,00
3 X 1,00
CAFFE CONV 3,00
CAFFE CONV 1,00
SUB-TOTALE 85,50
SCONTO -42,75
#TAV 75 – SHARON
TOTALE EURO 42.75 42,13
CONTANTI 42,75
NR.0012
27/02/18 14:11
MU1 72017007 PP 3,55
Xception
input
TFIDF
input
Predicting the category of the receipt
17
This was easy.
Model accuracy:
98%
Predict the amount
18
Predict the amount – From text to amounts dataframe
19
True
amount
Amount
rank
words
/ lines
EOL
before /
after /
around
Is this
amount a
decimal
number?
x words
around the
amount
Amount 1 No 0.1 0 0 …
Amount 2 Yes 0.2 1 1 …
… … … … … …
Target variable Embeddings
FONTANA SRL
VIA DEI CROCIFERI 12-13
00187 ROMA
P.IVA 08656441006
TEL 06.69925682
EURO
5 X 1,50
7,50
PANE E_O SERVIZI 2,50
ACQUA NATURALE
ACQUA MINERALELL 2,50
4 X 3,50
FIORI ZUCCA 14,00
4 X 11,00
TONNAR CAC_PEPE 44,00
RIGAT AMATRICIA 11,00
3 X 1,00
CAFFE CONV 3,00
CAFFE CONV 1,00
SUB-TOTALE 85,50
SCONTO -42,75
#TAV 75 – SHARON
TOTALE EURO 42.75 42,13
CONTANTI 42,75
NR.0012
27/02/18 14:11
MU1 72017007 PP 3,55
Predict the amount – Keras model on amounts dataframe
20
Word
embeddings
Other
features
Input
Layer
Additional
Input Layer
Flatten Dropout
Concatenate
Dense Layers Dropout
Loss = binary crossentropy
(Total amount / Not Total
amount)
Model accuracy:
94%
Per receipt accuracy:
86%
Predict the amount – Keras model on receipts dataframe
21
Word
embeddings
Other
features
Input
Layer
Additional
Dense Layers
Flatten Dropout
Concatenate
Dense Layers Dropout
Loss = binary crossentropy
(None, 40, 30) (None, 40, 30, 8)
Convolutions
Customized loss function
Dense (0, 40)
Tensors!
Per receipt accuracy:
86%
Predict the amount – Not a good accuracy?
22
~1200 bills (one third are taxi bills)
Predicting the amount and the date of the receipt
23
Run keras model on our receipt
Amount:
42.75
Amount probability:
82%
Use regex date patterns to extract the date
Date:
27/02/2018
Predict the amount and date – taxi receipts
24
Date: 25-01-2017
Amount: 20,00€
Find date box and amount box
Read date and amount from the boxes
Text to amount and date – taxi receipts
25
Xception model with changed last layers to categorize date box and amount box
Date
Amount
Text to amount and date – taxi receipts
26
Boxes inside date box and amount box
Text to amount and date – taxi receipts
27
Take the biggest box – discard smaller boxes contained over x% in another box
Text to amount and date – taxi receipts
28
Take the biggest box – discard smaller boxes contained over x% in another box
Text to amount and date – taxi receipts
29
Classify digits with xception model and apply rules for date and amount
Cursives and broken digits
2?-01-2017
Take-aways and next steps
30
1. Tunning Tesseract OCR takes a lot of effort
2. We need more data – new receipts on 30/9/2018
3. We need clean data – new receipts taken straight after payment
4. Retrain models
5. From amount to receipt level: Test probability thresholds –
introduce meta models
6. Introduce other languages
7. Currency
8. Taxi receipts
Thank You.
Contacts:
katarina.milosevic2@generali.com
AnalyticsSolutionCenter@generali.com

More Related Content

Similar to Ease the pain of Travel Refunds with AI - Katarina Milosevic

Partnership atos
Partnership atosPartnership atos
Partnership atos
acardoso78
 

Similar to Ease the pain of Travel Refunds with AI - Katarina Milosevic (20)

Beyond cards, phones and terminals: New payment form factors
Beyond cards, phones and terminals: New payment form factorsBeyond cards, phones and terminals: New payment form factors
Beyond cards, phones and terminals: New payment form factors
 
Partnership atos
Partnership atosPartnership atos
Partnership atos
 
Presentation on Fastag
Presentation on FastagPresentation on Fastag
Presentation on Fastag
 
fastag-191229112358.pdf
fastag-191229112358.pdffastag-191229112358.pdf
fastag-191229112358.pdf
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science Club
 
Railway reservation(c++ project)
Railway reservation(c++ project)Railway reservation(c++ project)
Railway reservation(c++ project)
 
Railway reservation(c++ project)
Railway reservation(c++ project)Railway reservation(c++ project)
Railway reservation(c++ project)
 
Mixed Integer Linear Programming Formulation for the Taxi Sharing Problem
Mixed Integer Linear Programming Formulation for the Taxi Sharing ProblemMixed Integer Linear Programming Formulation for the Taxi Sharing Problem
Mixed Integer Linear Programming Formulation for the Taxi Sharing Problem
 
Startup InsurTech Award - Kasko
Startup InsurTech Award - KaskoStartup InsurTech Award - Kasko
Startup InsurTech Award - Kasko
 
BlueSnap All-in-one Payment Platform Overview (June)
BlueSnap All-in-one Payment Platform Overview (June)BlueSnap All-in-one Payment Platform Overview (June)
BlueSnap All-in-one Payment Platform Overview (June)
 
Fraud Detection in Real-time @ Apache Big Data con
Fraud Detection in Real-time @ Apache Big Data conFraud Detection in Real-time @ Apache Big Data con
Fraud Detection in Real-time @ Apache Big Data con
 
Fraud Detection in Real-time @ Apache Big Data Con
Fraud Detection in Real-time @ Apache Big Data ConFraud Detection in Real-time @ Apache Big Data Con
Fraud Detection in Real-time @ Apache Big Data Con
 
Monetizing your apps with PayPal API:s
Monetizing your apps with PayPal API:sMonetizing your apps with PayPal API:s
Monetizing your apps with PayPal API:s
 
IRJET- Smart Toll and Penalty Collection System
IRJET- Smart Toll and Penalty Collection SystemIRJET- Smart Toll and Penalty Collection System
IRJET- Smart Toll and Penalty Collection System
 
Machine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWSMachine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWS
 
The electronic toll industry
The electronic toll industryThe electronic toll industry
The electronic toll industry
 
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
 
Mark Buitenhek, 5th Digital Banking Forum
Mark Buitenhek, 5th Digital Banking ForumMark Buitenhek, 5th Digital Banking Forum
Mark Buitenhek, 5th Digital Banking Forum
 
BIG IoT Project Overview
BIG IoT Project OverviewBIG IoT Project Overview
BIG IoT Project Overview
 
Commutetown
CommutetownCommutetown
Commutetown
 

More from Institute of Contemporary Sciences

More from Institute of Contemporary Sciences (20)

First 5 years of PSI:ML - Filip Panjevic
First 5 years of PSI:ML - Filip PanjevicFirst 5 years of PSI:ML - Filip Panjevic
First 5 years of PSI:ML - Filip Panjevic
 
Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...
 
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen DraskovicData Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
 
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
 
Solving churn challenge in Big Data environment - Jelena Pekez
Solving churn challenge in Big Data environment  - Jelena PekezSolving churn challenge in Big Data environment  - Jelena Pekez
Solving churn challenge in Big Data environment - Jelena Pekez
 
Application of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar DilovApplication of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar Dilov
 
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
 
Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...
 
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...
 
Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...
 
From Zero to ML Hero for Underdogs - Amir Tabakovic
From Zero to ML Hero for Underdogs  - Amir TabakovicFrom Zero to ML Hero for Underdogs  - Amir Tabakovic
From Zero to ML Hero for Underdogs - Amir Tabakovic
 
Data and data scientists are not equal to money david hoyle
Data and data scientists are not equal to money   david hoyleData and data scientists are not equal to money   david hoyle
Data and data scientists are not equal to money david hoyle
 
The price is right - Tomislav Krizan
The price is right - Tomislav KrizanThe price is right - Tomislav Krizan
The price is right - Tomislav Krizan
 
When it's raining gold, bring a bucket - Andjela Culibrk
When it's raining gold, bring a bucket - Andjela CulibrkWhen it's raining gold, bring a bucket - Andjela Culibrk
When it's raining gold, bring a bucket - Andjela Culibrk
 
Reality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos SolujicReality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos Solujic
 
Sensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir BrusicSensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir Brusic
 
Improving Data Quality with Product Similarity Search
Improving Data Quality with Product Similarity SearchImproving Data Quality with Product Similarity Search
Improving Data Quality with Product Similarity Search
 
Prediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognitionPrediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognition
 
Using data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local governmentUsing data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local government
 
Geospatial Analysis and Open Data - Forest and Climate
Geospatial Analysis and Open Data - Forest and ClimateGeospatial Analysis and Open Data - Forest and Climate
Geospatial Analysis and Open Data - Forest and Climate
 

Recently uploaded

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 

Recently uploaded (20)

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 

Ease the pain of Travel Refunds with AI - Katarina Milosevic

  • 1. Internal Ease the pain of Travel Refunds with AI Generali Group - Analytics Solutions Center Belgrade, 18th September 2018
  • 2. Context 2 When I say “travel expense” - what do you think about?
  • 4. Challenge 8000 employees 600.000 hours of employee time + 600.000 hours of HR lost per year …and a lot of frustration Divide the process time by 10
  • 5. RefundAnalytics Solution - Create an automatic travel expenses classifier integrated with a smartphone bot 5 Telegram Bot HR System Receipts from business trips OCR
  • 6. RefundAnalytics Solution - Create an automatic travel expenses classifier integrated with a smartphone bot 6 Telegram Bot HR System Receipts from business trips OCR
  • 7. Solution - Create an automatic travel expenses classifier integrated with a smartphone bot 7 HR System *Categories: Restaurant, Transport (except taxi), Taxi, City tax, Mixed type (Hotel: City tax + Restaurant) Category: Restaurant Amount: 42.75 Date: 27/02/2018 Currency: Euro
  • 8. Solution - Create an automatic travel expenses classifier integrated with a smartphone bot 8 Telegram Bot HR System Receipts from business trips RefundOCR Analytics
  • 9. OCR process – Test available tools 9 Tesseract Google API Microsoft API Amount extracted 85.2% 92.7% 64.7% Date extracted 65.5% 86.3% 42.6% Tesseract: Pre-processing – psm, rotate, resize, denoising, removing lines, contrast Google: Sending images to Google servers 5sec 1.8sec
  • 10. From OCR to meaningful words 10 Flatten the boxes to line-by-line text Replace misspellings and lemmatize
  • 11. Predicting the category of the receipt 11 Taxi Transport (except taxi) Restaurant City tax Mixed type (city tax + meal)
  • 12. Predicting the category of the receipt 12 FONTANA SRL VIA DEI CROCIFERI 12-13 00187 ROMA P.IVA 08656441006 TEL 06.69925682 EURO 5 X 1,50 7,50 PANE E_O SERVIZI 2,50 ACQUA NATURALE ACQUA MINERALELL 2,50 4 X 3,50 FIORI ZUCCA 14,00 4 X 11,00 TONNAR CAC_PEPE 44,00 RIGAT AMATRICIA 11,00 3 X 1,00 CAFFE CONV 3,00 CAFFE CONV 1,00 SUB-TOTALE 85,50 SCONTO -42,75 #TAV 75 – SHARON TOTALE EURO 42.75 42,13 CONTANTI 42,75 NR.0012 27/02/18 14:11 MU1 72017007 PP 3,55 Category: Restaurant Category probability: 98% TFIDF input Random Forest Use TFIDF features to build category model from text
  • 13. Predicting the category of the receipt 13 Let’s play a game. Can you guess the category of this receipt based on the text extracted?
  • 14. Predicting the category of the receipt 14 Which category? Dal 1960 al vostro servizio a ricerca satellitare Milano PAGATA CON02.8585 a ricerca satellitare Milano 8/3 67 Credit Card Signature Firma OS. € From/to Globix cartaceo Abbonamento Ranamento Dala Anali RISTORANTE PIZZERIA CAFFE MIE Esente IVA art. 10 N. 14 del 26.10.1972 N. 633 Contante
  • 15. Predicting the category of the receipt 15 It’s a taxi receipt
  • 16. Predicting the category of the receipt 16 Use TFIDF features and add xception features from xception network to the model FONTANA SRL VIA DEI CROCIFERI 12-13 00187 ROMA P.IVA 08656441006 TEL 06.69925682 EURO 5 X 1,50 7,50 PANE E_O SERVIZI 2,50 ACQUA NATURALE ACQUA MINERALELL 2,50 4 X 3,50 FIORI ZUCCA 14,00 4 X 11,00 TONNAR CAC_PEPE 44,00 RIGAT AMATRICIA 11,00 3 X 1,00 CAFFE CONV 3,00 CAFFE CONV 1,00 SUB-TOTALE 85,50 SCONTO -42,75 #TAV 75 – SHARON TOTALE EURO 42.75 42,13 CONTANTI 42,75 NR.0012 27/02/18 14:11 MU1 72017007 PP 3,55 Xception input TFIDF input
  • 17. Predicting the category of the receipt 17 This was easy. Model accuracy: 98%
  • 19. Predict the amount – From text to amounts dataframe 19 True amount Amount rank words / lines EOL before / after / around Is this amount a decimal number? x words around the amount Amount 1 No 0.1 0 0 … Amount 2 Yes 0.2 1 1 … … … … … … … Target variable Embeddings FONTANA SRL VIA DEI CROCIFERI 12-13 00187 ROMA P.IVA 08656441006 TEL 06.69925682 EURO 5 X 1,50 7,50 PANE E_O SERVIZI 2,50 ACQUA NATURALE ACQUA MINERALELL 2,50 4 X 3,50 FIORI ZUCCA 14,00 4 X 11,00 TONNAR CAC_PEPE 44,00 RIGAT AMATRICIA 11,00 3 X 1,00 CAFFE CONV 3,00 CAFFE CONV 1,00 SUB-TOTALE 85,50 SCONTO -42,75 #TAV 75 – SHARON TOTALE EURO 42.75 42,13 CONTANTI 42,75 NR.0012 27/02/18 14:11 MU1 72017007 PP 3,55
  • 20. Predict the amount – Keras model on amounts dataframe 20 Word embeddings Other features Input Layer Additional Input Layer Flatten Dropout Concatenate Dense Layers Dropout Loss = binary crossentropy (Total amount / Not Total amount) Model accuracy: 94% Per receipt accuracy: 86%
  • 21. Predict the amount – Keras model on receipts dataframe 21 Word embeddings Other features Input Layer Additional Dense Layers Flatten Dropout Concatenate Dense Layers Dropout Loss = binary crossentropy (None, 40, 30) (None, 40, 30, 8) Convolutions Customized loss function Dense (0, 40) Tensors! Per receipt accuracy: 86%
  • 22. Predict the amount – Not a good accuracy? 22 ~1200 bills (one third are taxi bills)
  • 23. Predicting the amount and the date of the receipt 23 Run keras model on our receipt Amount: 42.75 Amount probability: 82% Use regex date patterns to extract the date Date: 27/02/2018
  • 24. Predict the amount and date – taxi receipts 24 Date: 25-01-2017 Amount: 20,00€ Find date box and amount box Read date and amount from the boxes
  • 25. Text to amount and date – taxi receipts 25 Xception model with changed last layers to categorize date box and amount box Date Amount
  • 26. Text to amount and date – taxi receipts 26 Boxes inside date box and amount box
  • 27. Text to amount and date – taxi receipts 27 Take the biggest box – discard smaller boxes contained over x% in another box
  • 28. Text to amount and date – taxi receipts 28 Take the biggest box – discard smaller boxes contained over x% in another box
  • 29. Text to amount and date – taxi receipts 29 Classify digits with xception model and apply rules for date and amount Cursives and broken digits 2?-01-2017
  • 30. Take-aways and next steps 30 1. Tunning Tesseract OCR takes a lot of effort 2. We need more data – new receipts on 30/9/2018 3. We need clean data – new receipts taken straight after payment 4. Retrain models 5. From amount to receipt level: Test probability thresholds – introduce meta models 6. Introduce other languages 7. Currency 8. Taxi receipts