Syoncloud big data for retail banking, Syoncloud

Syoncloud Big Data for Retail Banking | Syoncloud

14/10/2013

Big Data Analytics

News and Events

Retail Banking

Risk Management

About Us

Contact

Syoncloud Big Data for Retail Banking
Syoncloud offers comprehensive Big Data / Data Science solution for retail banks.
We cover areas such as:
Individualization of product offers to existing clients
Early fraud detection and fraud damage mitigation
Prediction of products cancellations and client's defections
Optimal allocation of cash to ATMs and bank branches
Minimization of usage of expensive bank channels such as branch visits
Reliable assessment of clients for debt products

Common Datasets
Common Datasets are used as a foundation for complex analysis.

Creation of Common Datasets for Analysis Related to Bank's Clients
We create a dataset of monthly expenses and incomes categories for all clients, all their accounts and complete history. This dataset is
created from bank accounts movements, direct debits and standing orders. Each account movement is usually accompanied with type of
movement code such as electricity, phone bill, restaurant type code and so on. We also use merchant's name, description and comment
fields to categorize each transaction. Direct debits and standing orders are also accompanied with type codes.
We recognize several categories of expenses such as housing expenses (rent or mortgage), energy expenses (gas and electricity), food and
household related expenses, education (schools, books, courses), car expenses (fuel and repairs), restaurants, big ticket items (TV, furniture),
taxes, recreation and hobby, credit card and loan payments, luxury items and so on.
Income categories are salaries, dividends, tax refunds, social benefits, rental income, sales and so on. Simple regression analysis of this
dataset gives us overall trends for total expenses, incomes and savings as well as detail trends for each category of incomes and expenses for
each client.

Machine Learning and Predictions
We use full range of machine learning algorithms and models to make predictions. There are two broad categories supervised and
unsupervised algorithms.
Supervised learning algorithms use historical data to learn that certain combinations and values of inputs cause certain outputs. We create
models that are trained and verified on samples of historical data. Sample data can be chosen randomly but we have seen better results if we
categorize our datasets first. In case of customer dataset we create categories such as age, income, location based on town size, education
and savings. Each category is split into brackets. For example age category is split into 20 five years age brackets. We know how many
customers is in each age bracket so we can sample certain percentage of records from each age bracket. The same way we sample other
categories. These samples are ideal to see what category make largest contribution to overall results. For example we can see that education
makes largest contribution to accept certain investment product.
Unsupervised machine learning algorithms look for unknown patterns in available data.
For example we find patterns of unusual behaviour of clients to find early signs of frauds. In past we were limited by statistical analysis of
behaviour that was common for all clients all large groups of clients. We unsupervised learning models we can find patterns that surface
only in small number of records.

Individualization of Product Offers
Individualization of product offers to existing clients. Banks save money on expensive broad marketing campaigns for bank products.
Products will be offered only to customers that need them and are likely to accept them. Customers should see less of irrelevant offers. This
requires deep knowledge who accepted given products in past.

http://www.syoncloud.com/Syoncloud_Big_Data_for_Retail_Banking

1/5


14/10/2013

As an input for our models we use dataset of subscriptions to bank products and service for each client. This dataset includes previous
subscriptions and cancellation dates. We also use common dataset of incomes and expenses categories for each client and CRM data about
clients. We have created separate models for each product and subscription. In order to prepare suitable models we have to not only chose
and verify the best learning algorithm but also to find which categories and variables do have the biggest influence.

Early fraud detection and fraud damage mitigation
This includes detection of identity frauds, credit card frauds, wire frauds, attacks on internet and mobile banking and money laundering.
New types of frauds and new schemes require flexible and fast detection algorithms. In past banks used only statistical and rule based
algorithms to find if suspicious activity is taken place on customer's account. These algorithms were limited because they can only recognize
known frauds, they require expensive maintenance, they do not work with full history of each client and they have high level of false
positives.
We utilized dataset of known fraud cases. We have created several categories of these frauds such as overdraft fraud with stolen identity,
stolen credit card, consumer loan fraud, credit card top up with fraudulent check, stolen checks, skimming with card duplication, attacks on
online banking with stolen customer's credential and/or security devices, rogue online merchant frauds using credit cards and so on. We use
neuronal networks with back propagation, decision tree algorithms and classification to find patterns and unknown occurrences of these
frauds in our existing data.

Prediction of Product Cancellations and Client's defections
A prediction of bank products cancellations and client's defections is very time sensitive. Bank has just days to act before client irreversibly
decide to cancel a product or move to competition. Bank needs to identify clients who are likely to defect, contact them and pro-activelly
offer alternative products or solve client's issues. It is much cheaper to retain highly profitable clients than to attract them back.
We have used account movements, debit and credit card movements, clients dataset from CRM, product subscription dataset, call centre
and branch visits transactions and log information as primary data sources for our analysis. We have also utilized common datasets of
incomes and expenses.
We have prepared timeseries of key events such as direct debits cancellations, income to the account from salaries, dividends and rents,
transfers to client's accounts at different banks, call centre and branch contacts made by the client separated into categories, cancellations of
credit cards and so on.
We have prepared another set of clients that do match categories such as age, income, saving and location for the same time interval but
who still remain clients. We have prepared matching timeseries for these clients as well.
Based on this data we were able to create models that are able to predict behaviour of clients before they irreversibly decide to move to
competitors. We have used several supervised learning algorithms such as Support Vector Machines for binary classification and Neural
Network with Backpropagation for predictions. From unsupervised machine learning algorithms we have utilized K-Means and Mean Shift
Clustering after Principal Component Analysis was applied to reduce dimensions of input data.
We have identified several hundreds profitable clients in recent data who match patterns of clients who moved their accounts to
competitors. These clients should be contacted by their respective bank branches.

Optimal Allocation of Cash for ATMs and Bank Branches
Demand for cash is highly variable during year at many ATMs and bank branch locations. The variability is caused by weather, local events,
vacations, tourism and so on. It is important to predict right amount cash that needs to be deposited into ATMs as well as bank branches. It
is costly to service ATMs too often, it is also costly to have cash machines out of order due lack of cash. In the same time we want to limit
amount of unnecessary cash that is stored for long times in ATMs and bank branches. It leads to suboptimal cash allocation as well as it
attracts crime.
As the primary datasets we have used ATM service logs, geographic locations of ATMs and bank branches, withdraws dataset for each ATM,
weather reports for ATMs and bank branch locations, schedules of sports, cultural or other events as well as holidays for all locations. We
have utilized credit and debit card movements to assess demand for cash at various locations and during different times of the year. We
have used common datasets of incomes to see when salaries, social benefits and other incomes arrived to client's accounts at different
locations.
We have created dataset of median amounts of cash withdraws for each day of the year and hour of day for all ATMs. This dataset is used to
calculate influence of weather, events, day of the week or holidays on demands for cash at given location.


2/5


14/10/2013

We have prepared dataset of significant cultural, sport and other events during past 4 years with location coordinates. We have calculated
influence of each event on cash demand for all ATMs that are in 300m radius of given event. We were able to sort all events based on
influence on cash demand. This dataset is used for predictions of influence of similar events.
We have also calculated correlation between local weather parameters such precipitation, temperature and wind at location of each ATM
with cash demand.
We have created correlation dataset between days when clients receive incomes, such as salaries and social benefits, and cash demands at
different locations.
We have prepared models that can predict cash demand for each day of the year for each ATM and bank branch location. This model takes
into results from historical datasets as well as weather forecast data and schedules of events. We have utilized algorithms such as Restricted
Boltzmann Machine, Perceptron and Gaussian Discriminative Analysis.

Minimize Use of Expensive Channels
We can minimize the use of expensive bank channels such as over-the-counter operations and other visits of bank branches as well as calls
to call centres.
This can be achieve by optimizations of online banking and mobile banking applications, help pages and wizards as well as optimization of
web pages on bank's websites. Another way to encourage reluctant clients to switch to cheaper channels is by targeted campaigns.
Our primary sources of data for analysis were web log files from online banking application as well as mobile banking applications. We have
also used bank accounts movements with codes of bank channels, dataset of call centre transactions, CRM dataset with information about
customers and dataset of transactions from bank branches.
An important dataset was complains and enquiries from call centre, emails, letters and branches. We have sorted this datasets by areas of
interest and correlated them with help web pages. We were able to identify help pages that were unclear and caused confusion and
unnecessary calls to call centre. We have also identified several operations in online banking that were complex and generated higher
amount of complains. We have uncovered several areas related to exchange rates during credit cards payments that were not covered by
help pages but were often discussed over the phone or even by bank branch visits. Changes made to bank products related web pages, self
helps, search optimizations, online banking operations and mobile banking applications can bring quick savings on outsourced call centres
and bank branch visits.
We have analysed results from marketing campaigns to move reluctant clients to online and mobile banking or self serving kiosks. We have
used correlation analysis and we have seen that broad marketing campaigns were not efficient. We have analyse patterns of bank clients
who recently moved most of the operations online. This gave us a tool to select portion of clients that are more likely to move online. These
customers should be targeted by personalized marketing campaigns or by demonstration of advantages at bank branches.

Assessment of Clients for Debt Products
In order to reliably assess risks and approve debt products to existing clients we need take into account not just current credit scores and
current disposable income of the clients but also complete history of the client as well as social context. This decreases risk for the bank as
well increase income from valuable clients who would be otherwise rejected.
As a primary source of data we have used common dataset of incomes and expenses, complete history of payment morale for credit cards,
consumer loans, mortgages, overdrafts and other debt products and CRM information about clients.
We have used Markov Chain stochastic process to assess debt and payment morale related behaviour of clients. This model was tested on
historical data of profitable and defaulted loans, credit cards and other debt products. We have noticed improved of reliability of credit scores
and we were able to suggest suitable alternative debt products for rejected clients.

Overview of Primary Datasets and Sizing Example
These are examples of primary datasets and sizing calculations. Each project is specific and not all datasets are available but data
sizing calculations are likely to be similar.
Account movements for all active and former clients. Given dataset includes complete history of account movements for all current and
savings accounts. This dataset contains 6 millions unique clients and 23 millions active and closed accounts. An average size of movements
per account is 1MB this give us 23TB of uncompressed de-normalized CSV files.


3/5


14/10/2013

Dataset of debit and credit card movements contains 25 millions unique card Ids. We have on an average 3 thousand transactions per
single card number. Total number of records is 75 billions. Each record in uncompressed CSV form has 1kB. The total size of this dataset is
75TB.
Technical log files from internet and mobile banking applications have 50TB. These files include front-end Apache log files as well
as applications logs.
Bank transactions, requests for help and complains from call centre. This datasets contains bank transactions, requests for help
and complains from 1 million unique customers. An average number of interactions per customer is 35. Typical size of an interaction is
10kB. The total size of the dataset is 350GB.
CRM information about clients with historical values include personal information about customers such as employment, education,
age, family status. Dataset includes current and historical information for about 6 millions clients with typical size 100kB per client. Total
size is 600GB.
Direct debits and standing orders of bank clients with historical values. The typical number of standing orders and direct debits
per client with historical values is 50. A size of single record is 1kB. The total size of dataset for 6 millions clients and 50 records per client is
300GB.
Product subscriptions data for all clients with complete history. A typical number of current and historical subscriptions per single
client is 12. This includes accounts, mortgages, loans, credit cards and other bank products. We have 6 millions clients multiplied by 12
average number of subscriptions per client and multiplied by 1kB per subscription is 72GB.
Customer's data from branch visits. This dataset includes over-the-counter bank transactions, help requests, product subscriptions and
cancellations and complains. Typical number of interactions per client is 10. We do have large differences in utilization of branch services
among clients. 3 millions clients and 10kB per interaction means 300GB.
Dataset of debtors and dataset of failed applications for debt products. The total size of 1 million records in these datasets is 1GB
Help files usage from mobile and internet banking. 6 millions users multiplied by 1000 average number of clicks to help files
multiplied by 1kB an average size of the record is 6TB
The total size of all primary datasets is 156TB. The result is calculated as a simple sum such as: 75TB + 50TB + 23TB + 6TB + 600GB
+ 350GB + 300GB + 300GB + 72GB + 1GB = 156TB. We can reduce overall size by using compression and we can remove technical fields
that do not carry any business meaning from the datasets. Log files are also reduced by removing lines with no business meaning.

Implementation Steps
Isolation of sensitive data from Big Data analytics
In order to isolate Big Data analytics from sensitive data we remove clients' names, addresses, telephone numbers and emails during data
export processes.
The next step is to create process that replaces real credit and debit card numbers, account numbers and customer's Ids by randomly
generated numbers. These randomly generated numbers must be identical for the same entity across different datasets to enable analytics.
This process stores pairs of matching real numbers and randomly generated numbers into tables. These tables are stored in separate secure
relational database that is continuously updated. This database is also used to match randomly generated numbers with real numbers after
Big Data analysis are performed. This enables isolation of data scientists and administrators from sensitive information that is only
accessible to authorized bank's employees.

Extraction, Transformation and Loading of Primary Datasets
We do have initial ETL (Extraction, Transformation and Loading) of data and continuous processes of daily or hourly updates and imports
of recent data from production systems of the bank.
Initial extraction was performed by bank's production and backup systems. Data was extracted in denormalized text form in CSV or fix
length field formats. This form is an ideal for bulk uploads into Big Data systems. Denormalized form uses concrete values instead of
reference Ids as in relational databases.
Continuous data exports are channeled via JMS, MQ Series, CSV files and via Sqoop. Exported data are picked up by Big Data scripts such as
Pig or Hive. These scripts are triggered via Oozie processes.

Transformation of Input Data
Transformation rules and scripts are shared by initial and continuous ETL processes. We have used Pig and Hive scripts and Java written
UDF (User Defined Functions) to perform transformation steps. Oozie workflows were used to chain transformation steps.
We have used several practical rules for data transformations:
Various file formats are separated into its own directories inside HDFS (Hadoop file system)
Unprocessed and failed records are written into specific directories for manual investigation.
Intermediate result files are deleted after all transformation steps are successfully performed. This saves HDFS space as well as enable
to investigate and re-run incomplete transformations.

4/5


14/10/2013

Pig and Hive scripts are kept simple and single purpose. This enables easy debugging and re-use.
Java UDFs are only used if given function was not available in standard library or in PiggyBank library.
Transformation scripts are reused for processing updates.

Pow ered by Drupal


5/5

Syoncloud big data for retail banking, Syoncloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Syoncloud big data for retail banking, Syoncloud

Similar to Syoncloud big data for retail banking, Syoncloud (20)

Recently uploaded

Recently uploaded (20)

Syoncloud big data for retail banking, Syoncloud