Ease the pain of Travel Refunds with AI - Katarina Milosevic

Internal
Ease the pain of Travel Refunds with AI
Generali Group - Analytics Solutions Center
Belgrade, 18th September 2018

Context
2
When I say
“travel expense”
-
what do you think about?

Challenge
8000 employees
600.000 hours of employee time
+ 600.000 hours of HR lost per year
…and a lot of frustration
Divide the process time by 10

RefundAnalytics
Solution - Create an automatic travel expenses classifier integrated
with a smartphone bot
5
Telegram Bot
HR System
Receipts from
business trips OCR

RefundAnalytics
6
Telegram Bot
HR System
Receipts from
business trips OCR

7
HR System
*Categories: Restaurant, Transport (except taxi), Taxi,
City tax, Mixed type (Hotel: City tax + Restaurant)
Category: Restaurant
Amount: 42.75
Date: 27/02/2018
Currency: Euro

8
Telegram Bot
HR System
Receipts from
business trips RefundOCR Analytics

OCR process – Test available tools
9
Tesseract Google API Microsoft API
Amount extracted 85.2% 92.7% 64.7%
Date extracted 65.5% 86.3% 42.6%
Tesseract: Pre-processing – psm, rotate, resize, denoising, removing lines,
contrast
Google: Sending images to Google servers
5sec 1.8sec

From OCR to meaningful words
10
Flatten the boxes to line-by-line text
Replace misspellings and lemmatize

Predicting the category of the receipt
11
Taxi
Transport
(except taxi)
Restaurant
City tax
Mixed type (city
tax + meal)

12
FONTANA SRL
VIA DEI CROCIFERI 12-13
00187 ROMA
P.IVA 08656441006
TEL 06.69925682
EURO
5 X 1,50
7,50
PANE E_O SERVIZI 2,50
ACQUA NATURALE
ACQUA MINERALELL 2,50
4 X 3,50
FIORI ZUCCA 14,00
4 X 11,00
TONNAR CAC_PEPE 44,00
RIGAT AMATRICIA 11,00
3 X 1,00
CAFFE CONV 3,00
CAFFE CONV 1,00
SUB-TOTALE 85,50
SCONTO -42,75
#TAV 75 – SHARON
TOTALE EURO 42.75 42,13
CONTANTI 42,75
NR.0012
27/02/18 14:11
MU1 72017007 PP 3,55
Category:
Restaurant
Category
probability:
98%
TFIDF
input
Random
Forest
Use TFIDF features to build category model from text

13
Let’s play a game.
Can you guess the category of this
receipt based on the text extracted?

14
Which category?
Dal 1960 al vostro servizio
a ricerca satellitare
Milano PAGATA CON02.8585 a ricerca satellitare
Milano 8/3 67
Credit Card
Signature Firma OS. € From/to Globix cartaceo
Abbonamento Ranamento Dala Anali
RISTORANTE PIZZERIA CAFFE MIE
Esente IVA art. 10 N. 14 del 26.10.1972 N. 633
Contante

15
It’s a taxi receipt

16
Use TFIDF features and
add xception features from xception network to the model
FONTANA SRL
00187 ROMA
P.IVA 08656441006
TEL 06.69925682
EURO
5 X 1,50
7,50
ACQUA NATURALE
4 X 3,50
FIORI ZUCCA 14,00
4 X 11,00
3 X 1,00
CAFFE CONV 3,00
CAFFE CONV 1,00
SUB-TOTALE 85,50
SCONTO -42,75
#TAV 75 – SHARON
CONTANTI 42,75
NR.0012
27/02/18 14:11
MU1 72017007 PP 3,55
Xception
input
TFIDF
input

17
This was easy.
Model accuracy:
98%

Predict the amount – From text to amounts dataframe
19
True
amount
Amount
rank
words
/ lines
EOL
before /
after /
around
Is this
amount a
decimal
number?
x words
around the
amount
Amount 1 No 0.1 0 0 …
Amount 2 Yes 0.2 1 1 …
… … … … … …
Target variable Embeddings
FONTANA SRL
00187 ROMA
P.IVA 08656441006
TEL 06.69925682
EURO
5 X 1,50
7,50
ACQUA NATURALE
4 X 3,50
FIORI ZUCCA 14,00
4 X 11,00
3 X 1,00
CAFFE CONV 3,00
CAFFE CONV 1,00
SUB-TOTALE 85,50
SCONTO -42,75
#TAV 75 – SHARON
CONTANTI 42,75
NR.0012
27/02/18 14:11
MU1 72017007 PP 3,55

Predict the amount – Keras model on amounts dataframe
20
Word
embeddings
Other
features
Input
Layer
Additional
Input Layer
Flatten Dropout
Concatenate
Dense Layers Dropout
Loss = binary crossentropy
(Total amount / Not Total
amount)
Model accuracy:
94%
Per receipt accuracy:
86%

Predict the amount – Keras model on receipts dataframe
21
Word
embeddings
Other
features
Input
Layer
Additional
Dense Layers
Flatten Dropout
Concatenate
Dense Layers Dropout
Loss = binary crossentropy
(None, 40, 30) (None, 40, 30, 8)
Convolutions
Customized loss function
Dense (0, 40)
Tensors!
Per receipt accuracy:
86%

Predict the amount – Not a good accuracy?
22
~1200 bills (one third are taxi bills)

Predicting the amount and the date of the receipt
23
Run keras model on our receipt
Amount:
42.75
Amount probability:
82%
Use regex date patterns to extract the date
Date:
27/02/2018

Predict the amount and date – taxi receipts
24
Date: 25-01-2017
Amount: 20,00€
Find date box and amount box
Read date and amount from the boxes

Text to amount and date – taxi receipts
25
Xception model with changed last layers to categorize date box and amount box
Date
Amount

26
Boxes inside date box and amount box

27
Take the biggest box – discard smaller boxes contained over x% in another box

28
Take the biggest box – discard smaller boxes contained over x% in another box

29
Classify digits with xception model and apply rules for date and amount
Cursives and broken digits
2?-01-2017

Take-aways and next steps
30
1. Tunning Tesseract OCR takes a lot of effort
2. We need more data – new receipts on 30/9/2018
3. We need clean data – new receipts taken straight after payment
4. Retrain models
5. From amount to receipt level: Test probability thresholds –
introduce meta models
6. Introduce other languages
7. Currency
8. Taxi receipts

Thank You.
Contacts:
katarina.milosevic2@generali.com
AnalyticsSolutionCenter@generali.com

Ease the pain of Travel Refunds with AI - Katarina Milosevic

Recommended

Recommended

More Related Content

Similar to Ease the pain of Travel Refunds with AI - Katarina Milosevic

Similar to Ease the pain of Travel Refunds with AI - Katarina Milosevic (20)

More from Institute of Contemporary Sciences

More from Institute of Contemporary Sciences (20)

Recently uploaded

Recently uploaded (20)

Ease the pain of Travel Refunds with AI - Katarina Milosevic