Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Luis Peinado Fuentes, BBVA Data & Analytics
Jose A. Rodriguez Serrano, BBVA Data & Analytics
Tagging Text in Money Transfe...
Our Goal:
Improving User Experience
in Retail Banking Products
with Machine Learning
(Using Spark)
#EUds7
Dashboard View of Expenses and Income
Anticipated Expenses
Customer Comparison Engine
#EUds7
All these examples require
categorized bank transactions
#EUds7
GAS COMPANY VAT 1234567X
Categorizing Bank Transactions
#EUds7
232 - 60.30 € 23/10/2015
...
BILL NO. A12345
Account operat...
The Problem: Categorizing Transfers
#EUds7
007 - 60.30 € 23/10/2015 ...
...
PAYMENT RENT OCTOBER FLAT 15
Account operation...
Data Science Challenges
• We did not know the data source in advance
• We did not have a labeled set
• A fraction of texts...
Basic Pipeline (in Pro since 2016)
#EUds7
GAS BILL/OCTOBER
Source text Sentence
classification
GAS
BILL
OCTOBER
Tokenizati...
Where Do We Get a Dataset From?
First Attempt:
Domain adaptation
Second Attempt:
“Assisted Dataset creation”
#EUds7
x y x’...
Exploring better embeddings/poolings: w2v + VLAD
Word2vec “Synonyms”
#EUds7
viaje à billete vuelo hotel reserva billetes g...
Vector of Locally Aggregated Descriptors (VLAD)
#EUds7
Word 1 Word 2 Word N…
…
Cluster k
1. Assign word code
to nearest cl...
Results of the Experiment
#EUds7
Method Recall @ Prec=98%
TF-IDF 21.0 %
Word2vec (avg pool) 24.5 %
Word2vec (avg pool)+
Am...
Enablers
#EUds7
7065%
3004%
1320%
741%
271%
LONDRES/%LONDON%
PARIS%
NEW%YORK%/%NUEVA%YORK%/%NEWYORK%
ROMA%
AMSTERDAM%
When...
Other Challenges
The system goes to the wild…
• Failure cases stay within
the tolerated 2%
Open system
• Word black-list n...
Conclusions
How Spark Helped Us
• Spark (+MLlib) was crucial in this use case.
• We complemented with own implementations ...
Luis Peinado Fuentes, BBVA Data & Analytics
Jose A. Rodriguez Serrano, BBVA Data & Analytics
Tagging Text in Money Transfe...
Appendix
(Just in case)
#EUds7
Exploring better embeddings/poolings: w2v + VLAD
Word2vec “Synonyms”
#EUds7
viaje à billete vuelo hotel reserva billetes g...
References
Scientific literature
• Jegou et al., Aggregating local descriptors into compact codes, IEEE Trans. on Pattern ...
Upcoming SlideShare
Loading in …5
×

Tagging Text in Money Transfers: A Use Case of Apache Spark in Production for Banking with Jose Roriguez Serrano

986 views

Published on

At BBVA (second biggest bank in Spain), every money transfer a customer makes goes through an engine that infers a category from its textual description. This engine has been developed in Spark, mixes MLLib and own implementations, and is currently into production serving more than 5M customers daily. In our proposed presentation, we plan to describe the process undergone by the Data Science team. This includes the problem (classify 700K daily transfers by its text), the data science challenges, the algorithmic and engineering solution, and the achievements and learnings. To make the talk practical and focused, we will walk through our best performing pipeline, which uses (i) word2vec embeddings to represent words, (ii) a pooling algorithm to aggregate them into sentence embeddings, (iii) a supervised classifier. We believe this is relevant because it mixes (i) off-the-shelf MLLib funcions, such as the word2vec embeddings, (ii) components that build on Mllib and were adapted (e.g. a calibrated multic-class logistic regression which outputs probabilities) and (iii) an own implementations of the pooling algorithm, which is known as the Vector of Locally Aggregated Descriptors (VLAD). We highlight that we are not aware of the previous application of word2vec with VLAD in NLP, so this would introduce a novelty. The relevant audience are data scientists, engineers, team leaders and executives interested in Spark and seeking examples of machine learning deployments in a real-world productive environment. The main takeaways will be: (i) it is relatively simple to build a text classification system in Spark, (ii) with extra effort one it is feasible to build state-of-the-art, productive solutions. We will mention the problems we found in practice, such as how to design a training corpus to maximize precision, not recall, or how we designed the system against “catastrophic” classification mistakes.

Published in: Data & Analytics
  • 12 Signs From The Universe When You Are On The... ♥♥♥ http://scamcb.com/manifmagic/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Order Manifestation Magic Today For Up To 96% Off The Retail Price. Offer Expires Soon. Over 100,000 Satisfied Customers. Join Today And See For Yourself ▲▲▲ https://bit.ly/30Ju5r6
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • 7 Sacred Signs from the Universe, learn more... ♣♣♣ http://scamcb.com/manifmagic/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • How You Can Go From Dead Broke To Abundantly Wealthy By Using One Simple Mind Hack... ♥♥♥ https://tinyurl.com/y44vwbuh
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Stop getting scammed by online, programs that don't even work! ■■■ http://ishbv.com/ezpayjobs/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Tagging Text in Money Transfers: A Use Case of Apache Spark in Production for Banking with Jose Roriguez Serrano

  1. 1. Luis Peinado Fuentes, BBVA Data & Analytics Jose A. Rodriguez Serrano, BBVA Data & Analytics Tagging Text in Money Transfers: A Use-Case of Spark in Banking #EUds7 @BBVAData
  2. 2. Our Goal: Improving User Experience in Retail Banking Products with Machine Learning (Using Spark) #EUds7
  3. 3. Dashboard View of Expenses and Income
  4. 4. Anticipated Expenses
  5. 5. Customer Comparison Engine #EUds7
  6. 6. All these examples require categorized bank transactions #EUds7
  7. 7. GAS COMPANY VAT 1234567X Categorizing Bank Transactions #EUds7 232 - 60.30 € 23/10/2015 ... BILL NO. A12345 Account operation Taxonomy rules (Hive / Spark) Identifier matching (Hive) Company name matching (Lucene) Text classification (Spark + MLlib) Other busines rules (Hive) 7 Million Daily Operations (Card and Bank account) A12345 Bill Transaction gas bill groceries public transport rent + About 60 categories health insurance
  8. 8. The Problem: Categorizing Transfers #EUds7 007 - 60.30 € 23/10/2015 ... ... PAYMENT RENT OCTOBER FLAT 15 Account operation Code = 007 Amount < 0 Category = Outgoing Transfer Why not… Category = Household Expenses? 700,000 Daily Operations (Transfers)
  9. 9. Data Science Challenges • We did not know the data source in advance • We did not have a labeled set • A fraction of texts is useless (“detection” rather than classification) • Distribution of categories is imbalanced • Prefer false negatives over false positives • Very short text, language not even syntactically correct #EUds7
  10. 10. Basic Pipeline (in Pro since 2016) #EUds7 GAS BILL/OCTOBER Source text Sentence classification GAS BILL OCTOBER Tokenization Word Encoding Sentence Pooling • First implementation: TF-IDF features + linear classifier (98% precision, 21% recall • Further tests with word2vec + Vector of Locally Aggregated Descriptors (VLAD) • Implemented in Spark/Scala, using MLlib classes • Own classes implemented for Multi-class Logistic Regression, VLAD • Scala dependency injection useful to quickly setup variants of the above steps …
  11. 11. Where Do We Get a Dataset From? First Attempt: Domain adaptation Second Attempt: “Assisted Dataset creation” #EUds7 x y x’ y’ x y y’ Non-Transfers Transfers But: p(x, y) ≠ p(x’, y’) x’ Transfers p(x, y) = p(x’, y’) Assisted annotation with tagging interface. Tags suggested by rules, incomplete training, similar-string matches, and confirmed manually. > 45K Transfers Tagged
  12. 12. Exploring better embeddings/poolings: w2v + VLAD Word2vec “Synonyms” #EUds7 viaje à billete vuelo hotel reserva billetes gastos boda regalo casa alquiler à piso cdad alq comunidad local garaje mes prop calle pago à factura fra fras fact 14 resto fac 50 facturas val word2vec = new Word2Vec() val model = word2vec.fit(input) val synonyms = model.findSynonyms(“alquiler”, 10)
  13. 13. Vector of Locally Aggregated Descriptors (VLAD) #EUds7 Word 1 Word 2 Word N… … Cluster k 1. Assign word code to nearest cluster 2. Compute residuals for each cluster … 3. Concatenate residuals Embedding Own implementation in Spark/Scala Jegou et al., Aggregating local descriptors into compact codes, IEEE TPAMI, 2012
  14. 14. Results of the Experiment #EUds7 Method Recall @ Prec=98% TF-IDF 21.0 % Word2vec (avg pool) 24.5 % Word2vec (avg pool)+ Amount 26.9 % Word2vec (VLAD pool) 37.3 % • The combination of w2v + VLAD is not in production • Currently working on own multi-word embedding which yields similar results P/R curve for a tag We wantto be here (high precision, fewer false positives).
  15. 15. Enablers #EUds7 7065% 3004% 1320% 741% 271% LONDRES/%LONDON% PARIS% NEW%YORK%/%NUEVA%YORK%/%NEWYORK% ROMA% AMSTERDAM% When do household expenses happen? Aggregated statistics about trip destinations Bconomy – Identifying Rental Expenses
  16. 16. Other Challenges The system goes to the wild… • Failure cases stay within the tolerated 2% Open system • Word black-list needed • Proposal for daily quality checks (plots: daily evolution of number of word-tag pairs) M. Zinkevich, Rules of ML: Best Practices for ML Engineering #EUds7
  17. 17. Conclusions How Spark Helped Us • Spark (+MLlib) was crucial in this use case. • We complemented with own implementations of Multi-Class LR and Vector of Locally aggregated descriptors. • We took advantage of Scala’s code injection mechanisms to frame the classification process as a set of standardized steps. Data Science Experience • Running text classifier in a real production system • Improving the standard components through experiments on word2vec and VLAD • Ongoing work: own embedding method #EUds7
  18. 18. Luis Peinado Fuentes, BBVA Data & Analytics Jose A. Rodriguez Serrano, BBVA Data & Analytics Tagging Text in Money Transfers: A Use-Case of Spark in Banking #EUds7@BBVAData Thanks to: • Roberto Maestre and Advisory Team @ BBVA D&A for insightful comments • Everyone at BBVA who made using Spark possible and disseminated Spark and Scala knowledge
  19. 19. Appendix (Just in case) #EUds7
  20. 20. Exploring better embeddings/poolings: w2v + VLAD Word2vec “Synonyms” #EUds7 viaje à billete vuelo hotel reserva billetes gastos boda regalo casa alquiler à piso cdad alq comunidad local garaje mes prop calle jose à antonio juan manuel francisco luis maria miguel carmen jesus lloguer à pis pagament quota comunitat jordi josep despeses parking joan 2014 à 2013 06 07 08 04 05 02 09 03 madrid à viaje hotel barcelona sevilla malaga la san sl reserva pago à factura fra fras fact 14 resto fac 50 facturas enero à febrero marzo abril mayo septiembre noviembre junio agosto val word2vec = new Word2Vec() val model = word2vec.fit(input) val synonyms = model.findSynonyms(“alquiler”, 10)
  21. 21. References Scientific literature • Jegou et al., Aggregating local descriptors into compact codes, IEEE Trans. on Pattern Analysis and Machine Intelligence, 2012. • Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features, ECML 1998. • Mikolov et al., Distributed representation of words and phrases and their compositionality, NIPS 2013. • Clinchant and Perronnin, Aggregating Continuous Word Embeddings for Information Retrieval, Workshop on Continuous Vector Space Models and Their Compositionality, 2013. Docs and posts • Feature Extraction and Transformation using the RDD API • Word Embeddings in 2017: Trends and Future Directions • M. Zinkevich, Rules of ML: Best Practices for ML Engineering #EUds7

×