Talk would briefly describe business issue which I believe we all have - having to fill in expense reports in order to obtain travel refund from our company, and the smart AI solution we have developed in Generali Group.
We have set up Telegram bot and asked people to send the receipts by taking picture with their smartphone, instead of using SAP and scanning the receipts. Using this data, we have developed Data Science algorithms, as opposed to traditional computer vision, using Deep Learning models with words present in the receipt and other features we created to understand from which category receipt comes (Tfidf + Image Analytics for categorization model), and to retrieve total amount and date of the bill (Keras model with word embeddings + features engineered from receipt). This project represents big financial savings for Generali Head Office and Generali Italy, and even more important - employees’ satisfaction.
Moreover, this project got award to participate in Milan Digital Week Hackfest in Microsoft Italy for introducing handwritten recognition and nowadays we are designing the solution to put it in production. Talk would be structured in TEDx manner, given the fact that I have received the training for it from team which trains TEDx speakers in Italy, with steps which help audience with different level of ML knowledge to understand the project and the impact it has.
5. RefundAnalytics
Solution - Create an automatic travel expenses classifier integrated
with a smartphone bot
5
Telegram Bot
HR System
Receipts from
business trips OCR
6. RefundAnalytics
Solution - Create an automatic travel expenses classifier integrated
with a smartphone bot
6
Telegram Bot
HR System
Receipts from
business trips OCR
7. Solution - Create an automatic travel expenses classifier integrated
with a smartphone bot
7
HR System
*Categories: Restaurant, Transport (except taxi), Taxi,
City tax, Mixed type (Hotel: City tax + Restaurant)
Category: Restaurant
Amount: 42.75
Date: 27/02/2018
Currency: Euro
8. Solution - Create an automatic travel expenses classifier integrated
with a smartphone bot
8
Telegram Bot
HR System
Receipts from
business trips RefundOCR Analytics
9. OCR process – Test available tools
9
Tesseract Google API Microsoft API
Amount extracted 85.2% 92.7% 64.7%
Date extracted 65.5% 86.3% 42.6%
Tesseract: Pre-processing – psm, rotate, resize, denoising, removing lines,
contrast
Google: Sending images to Google servers
5sec 1.8sec
10. From OCR to meaningful words
10
Flatten the boxes to line-by-line text
Replace misspellings and lemmatize
11. Predicting the category of the receipt
11
Taxi
Transport
(except taxi)
Restaurant
City tax
Mixed type (city
tax + meal)
12. Predicting the category of the receipt
12
FONTANA SRL
VIA DEI CROCIFERI 12-13
00187 ROMA
P.IVA 08656441006
TEL 06.69925682
EURO
5 X 1,50
7,50
PANE E_O SERVIZI 2,50
ACQUA NATURALE
ACQUA MINERALELL 2,50
4 X 3,50
FIORI ZUCCA 14,00
4 X 11,00
TONNAR CAC_PEPE 44,00
RIGAT AMATRICIA 11,00
3 X 1,00
CAFFE CONV 3,00
CAFFE CONV 1,00
SUB-TOTALE 85,50
SCONTO -42,75
#TAV 75 – SHARON
TOTALE EURO 42.75 42,13
CONTANTI 42,75
NR.0012
27/02/18 14:11
MU1 72017007 PP 3,55
Category:
Restaurant
Category
probability:
98%
TFIDF
input
Random
Forest
Use TFIDF features to build category model from text
13. Predicting the category of the receipt
13
Let’s play a game.
Can you guess the category of this
receipt based on the text extracted?
14. Predicting the category of the receipt
14
Which category?
Dal 1960 al vostro servizio
a ricerca satellitare
Milano PAGATA CON02.8585 a ricerca satellitare
Milano 8/3 67
Credit Card
Signature Firma OS. € From/to Globix cartaceo
Abbonamento Ranamento Dala Anali
RISTORANTE PIZZERIA CAFFE MIE
Esente IVA art. 10 N. 14 del 26.10.1972 N. 633
Contante
16. Predicting the category of the receipt
16
Use TFIDF features and
add xception features from xception network to the model
FONTANA SRL
VIA DEI CROCIFERI 12-13
00187 ROMA
P.IVA 08656441006
TEL 06.69925682
EURO
5 X 1,50
7,50
PANE E_O SERVIZI 2,50
ACQUA NATURALE
ACQUA MINERALELL 2,50
4 X 3,50
FIORI ZUCCA 14,00
4 X 11,00
TONNAR CAC_PEPE 44,00
RIGAT AMATRICIA 11,00
3 X 1,00
CAFFE CONV 3,00
CAFFE CONV 1,00
SUB-TOTALE 85,50
SCONTO -42,75
#TAV 75 – SHARON
TOTALE EURO 42.75 42,13
CONTANTI 42,75
NR.0012
27/02/18 14:11
MU1 72017007 PP 3,55
Xception
input
TFIDF
input
19. Predict the amount – From text to amounts dataframe
19
True
amount
Amount
rank
words
/ lines
EOL
before /
after /
around
Is this
amount a
decimal
number?
x words
around the
amount
Amount 1 No 0.1 0 0 …
Amount 2 Yes 0.2 1 1 …
… … … … … …
Target variable Embeddings
FONTANA SRL
VIA DEI CROCIFERI 12-13
00187 ROMA
P.IVA 08656441006
TEL 06.69925682
EURO
5 X 1,50
7,50
PANE E_O SERVIZI 2,50
ACQUA NATURALE
ACQUA MINERALELL 2,50
4 X 3,50
FIORI ZUCCA 14,00
4 X 11,00
TONNAR CAC_PEPE 44,00
RIGAT AMATRICIA 11,00
3 X 1,00
CAFFE CONV 3,00
CAFFE CONV 1,00
SUB-TOTALE 85,50
SCONTO -42,75
#TAV 75 – SHARON
TOTALE EURO 42.75 42,13
CONTANTI 42,75
NR.0012
27/02/18 14:11
MU1 72017007 PP 3,55
20. Predict the amount – Keras model on amounts dataframe
20
Word
embeddings
Other
features
Input
Layer
Additional
Input Layer
Flatten Dropout
Concatenate
Dense Layers Dropout
Loss = binary crossentropy
(Total amount / Not Total
amount)
Model accuracy:
94%
Per receipt accuracy:
86%
21. Predict the amount – Keras model on receipts dataframe
21
Word
embeddings
Other
features
Input
Layer
Additional
Dense Layers
Flatten Dropout
Concatenate
Dense Layers Dropout
Loss = binary crossentropy
(None, 40, 30) (None, 40, 30, 8)
Convolutions
Customized loss function
Dense (0, 40)
Tensors!
Per receipt accuracy:
86%
22. Predict the amount – Not a good accuracy?
22
~1200 bills (one third are taxi bills)
23. Predicting the amount and the date of the receipt
23
Run keras model on our receipt
Amount:
42.75
Amount probability:
82%
Use regex date patterns to extract the date
Date:
27/02/2018
24. Predict the amount and date – taxi receipts
24
Date: 25-01-2017
Amount: 20,00€
Find date box and amount box
Read date and amount from the boxes
25. Text to amount and date – taxi receipts
25
Xception model with changed last layers to categorize date box and amount box
Date
Amount
26. Text to amount and date – taxi receipts
26
Boxes inside date box and amount box
27. Text to amount and date – taxi receipts
27
Take the biggest box – discard smaller boxes contained over x% in another box
28. Text to amount and date – taxi receipts
28
Take the biggest box – discard smaller boxes contained over x% in another box
29. Text to amount and date – taxi receipts
29
Classify digits with xception model and apply rules for date and amount
Cursives and broken digits
2?-01-2017
30. Take-aways and next steps
30
1. Tunning Tesseract OCR takes a lot of effort
2. We need more data – new receipts on 30/9/2018
3. We need clean data – new receipts taken straight after payment
4. Retrain models
5. From amount to receipt level: Test probability thresholds –
introduce meta models
6. Introduce other languages
7. Currency
8. Taxi receipts