Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Syoncloud big data for retail banking, Syoncloud


Published on

Published in: Economy & Finance, Business

Syoncloud big data for retail banking, Syoncloud

  1. 1. Syoncloud Big Data for Retail Banking | Syoncloud 14/10/2013 Big Data Analytics News and Events Retail Banking Risk Management About Us Contact Syoncloud Big Data for Retail Banking Syoncloud offers comprehensive Big Data / Data Science solution for retail banks. We cover areas such as: Individualization of product offers to existing clients Early fraud detection and fraud damage mitigation Prediction of products cancellations and client's defections Optimal allocation of cash to ATMs and bank branches Minimization of usage of expensive bank channels such as branch visits Reliable assessment of clients for debt products Common Datasets Common Datasets are used as a foundation for complex analysis. Creation of Common Datasets for Analysis Related to Bank's Clients We create a dataset of monthly expenses and incomes categories for all clients, all their accounts and complete history. This dataset is created from bank accounts movements, direct debits and standing orders. Each account movement is usually accompanied with type of movement code such as electricity, phone bill, restaurant type code and so on. We also use merchant's name, description and comment fields to categorize each transaction. Direct debits and standing orders are also accompanied with type codes. We recognize several categories of expenses such as housing expenses (rent or mortgage), energy expenses (gas and electricity), food and household related expenses, education (schools, books, courses), car expenses (fuel and repairs), restaurants, big ticket items (TV, furniture), taxes, recreation and hobby, credit card and loan payments, luxury items and so on. Income categories are salaries, dividends, tax refunds, social benefits, rental income, sales and so on. Simple regression analysis of this dataset gives us overall trends for total expenses, incomes and savings as well as detail trends for each category of incomes and expenses for each client. Machine Learning and Predictions We use full range of machine learning algorithms and models to make predictions. There are two broad categories supervised and unsupervised algorithms. Supervised learning algorithms use historical data to learn that certain combinations and values of inputs cause certain outputs. We create models that are trained and verified on samples of historical data. Sample data can be chosen randomly but we have seen better results if we categorize our datasets first. In case of customer dataset we create categories such as age, income, location based on town size, education and savings. Each category is split into brackets. For example age category is split into 20 five years age brackets. We know how many customers is in each age bracket so we can sample certain percentage of records from each age bracket. The same way we sample other categories. These samples are ideal to see what category make largest contribution to overall results. For example we can see that education makes largest contribution to accept certain investment product. Unsupervised machine learning algorithms look for unknown patterns in available data. For example we find patterns of unusual behaviour of clients to find early signs of frauds. In past we were limited by statistical analysis of behaviour that was common for all clients all large groups of clients. We unsupervised learning models we can find patterns that surface only in small number of records. Individualization of Product Offers Individualization of product offers to existing clients. Banks save money on expensive broad marketing campaigns for bank products. Products will be offered only to customers that need them and are likely to accept them. Customers should see less of irrelevant offers. This requires deep knowledge who accepted given products in past. 1/5
  2. 2. Syoncloud Big Data for Retail Banking | Syoncloud 14/10/2013 As an input for our models we use dataset of subscriptions to bank products and service for each client. This dataset includes previous subscriptions and cancellation dates. We also use common dataset of incomes and expenses categories for each client and CRM data about clients. We have created separate models for each product and subscription. In order to prepare suitable models we have to not only chose and verify the best learning algorithm but also to find which categories and variables do have the biggest influence. Early fraud detection and fraud damage mitigation This includes detection of identity frauds, credit card frauds, wire frauds, attacks on internet and mobile banking and money laundering. New types of frauds and new schemes require flexible and fast detection algorithms. In past banks used only statistical and rule based algorithms to find if suspicious activity is taken place on customer's account. These algorithms were limited because they can only recognize known frauds, they require expensive maintenance, they do not work with full history of each client and they have high level of false positives. We utilized dataset of known fraud cases. We have created several categories of these frauds such as overdraft fraud with stolen identity, stolen credit card, consumer loan fraud, credit card top up with fraudulent check, stolen checks, skimming with card duplication, attacks on online banking with stolen customer's credential and/or security devices, rogue online merchant frauds using credit cards and so on. We use neuronal networks with back propagation, decision tree algorithms and classification to find patterns and unknown occurrences of these frauds in our existing data. Prediction of Product Cancellations and Client's defections A prediction of bank products cancellations and client's defections is very time sensitive. Bank has just days to act before client irreversibly decide to cancel a product or move to competition. Bank needs to identify clients who are likely to defect, contact them and pro-activelly offer alternative products or solve client's issues. It is much cheaper to retain highly profitable clients than to attract them back. We have used account movements, debit and credit card movements, clients dataset from CRM, product subscription dataset, call centre and branch visits transactions and log information as primary data sources for our analysis. We have also utilized common datasets of incomes and expenses. We have prepared timeseries of key events such as direct debits cancellations, income to the account from salaries, dividends and rents, transfers to client's accounts at different banks, call centre and branch contacts made by the client separated into categories, cancellations of credit cards and so on. We have prepared another set of clients that do match categories such as age, income, saving and location for the same time interval but who still remain clients. We have prepared matching timeseries for these clients as well. Based on this data we were able to create models that are able to predict behaviour of clients before they irreversibly decide to move to competitors. We have used several supervised learning algorithms such as Support Vector Machines for binary classification and Neural Network with Backpropagation for predictions. From unsupervised machine learning algorithms we have utilized K-Means and Mean Shift Clustering after Principal Component Analysis was applied to reduce dimensions of input data. We have identified several hundreds profitable clients in recent data who match patterns of clients who moved their accounts to competitors. These clients should be contacted by their respective bank branches. Optimal Allocation of Cash for ATMs and Bank Branches Demand for cash is highly variable during year at many ATMs and bank branch locations. The variability is caused by weather, local events, vacations, tourism and so on. It is important to predict right amount cash that needs to be deposited into ATMs as well as bank branches. It is costly to service ATMs too often, it is also costly to have cash machines out of order due lack of cash. In the same time we want to limit amount of unnecessary cash that is stored for long times in ATMs and bank branches. It leads to suboptimal cash allocation as well as it attracts crime. As the primary datasets we have used ATM service logs, geographic locations of ATMs and bank branches, withdraws dataset for each ATM, weather reports for ATMs and bank branch locations, schedules of sports, cultural or other events as well as holidays for all locations. We have utilized credit and debit card movements to assess demand for cash at various locations and during different times of the year. We have used common datasets of incomes to see when salaries, social benefits and other incomes arrived to client's accounts at different locations. We have created dataset of median amounts of cash withdraws for each day of the year and hour of day for all ATMs. This dataset is used to calculate influence of weather, events, day of the week or holidays on demands for cash at given location. 2/5
  3. 3. Syoncloud Big Data for Retail Banking | Syoncloud 14/10/2013 We have prepared dataset of significant cultural, sport and other events during past 4 years with location coordinates. We have calculated influence of each event on cash demand for all ATMs that are in 300m radius of given event. We were able to sort all events based on influence on cash demand. This dataset is used for predictions of influence of similar events. We have also calculated correlation between local weather parameters such precipitation, temperature and wind at location of each ATM with cash demand. We have created correlation dataset between days when clients receive incomes, such as salaries and social benefits, and cash demands at different locations. We have prepared models that can predict cash demand for each day of the year for each ATM and bank branch location. This model takes into results from historical datasets as well as weather forecast data and schedules of events. We have utilized algorithms such as Restricted Boltzmann Machine, Perceptron and Gaussian Discriminative Analysis. Minimize Use of Expensive Channels We can minimize the use of expensive bank channels such as over-the-counter operations and other visits of bank branches as well as calls to call centres. This can be achieve by optimizations of online banking and mobile banking applications, help pages and wizards as well as optimization of web pages on bank's websites. Another way to encourage reluctant clients to switch to cheaper channels is by targeted campaigns. Our primary sources of data for analysis were web log files from online banking application as well as mobile banking applications. We have also used bank accounts movements with codes of bank channels, dataset of call centre transactions, CRM dataset with information about customers and dataset of transactions from bank branches. An important dataset was complains and enquiries from call centre, emails, letters and branches. We have sorted this datasets by areas of interest and correlated them with help web pages. We were able to identify help pages that were unclear and caused confusion and unnecessary calls to call centre. We have also identified several operations in online banking that were complex and generated higher amount of complains. We have uncovered several areas related to exchange rates during credit cards payments that were not covered by help pages but were often discussed over the phone or even by bank branch visits. Changes made to bank products related web pages, self helps, search optimizations, online banking operations and mobile banking applications can bring quick savings on outsourced call centres and bank branch visits. We have analysed results from marketing campaigns to move reluctant clients to online and mobile banking or self serving kiosks. We have used correlation analysis and we have seen that broad marketing campaigns were not efficient. We have analyse patterns of bank clients who recently moved most of the operations online. This gave us a tool to select portion of clients that are more likely to move online. These customers should be targeted by personalized marketing campaigns or by demonstration of advantages at bank branches. Assessment of Clients for Debt Products In order to reliably assess risks and approve debt products to existing clients we need take into account not just current credit scores and current disposable income of the clients but also complete history of the client as well as social context. This decreases risk for the bank as well increase income from valuable clients who would be otherwise rejected. As a primary source of data we have used common dataset of incomes and expenses, complete history of payment morale for credit cards, consumer loans, mortgages, overdrafts and other debt products and CRM information about clients. We have used Markov Chain stochastic process to assess debt and payment morale related behaviour of clients. This model was tested on historical data of profitable and defaulted loans, credit cards and other debt products. We have noticed improved of reliability of credit scores and we were able to suggest suitable alternative debt products for rejected clients. Overview of Primary Datasets and Sizing Example These are examples of primary datasets and sizing calculations. Each project is specific and not all datasets are available but data sizing calculations are likely to be similar. Account movements for all active and former clients. Given dataset includes complete history of account movements for all current and savings accounts. This dataset contains 6 millions unique clients and 23 millions active and closed accounts. An average size of movements per account is 1MB this give us 23TB of uncompressed de-normalized CSV files. 3/5
  4. 4. Syoncloud Big Data for Retail Banking | Syoncloud 14/10/2013 Dataset of debit and credit card movements contains 25 millions unique card Ids. We have on an average 3 thousand transactions per single card number. Total number of records is 75 billions. Each record in uncompressed CSV form has 1kB. The total size of this dataset is 75TB. Technical log files from internet and mobile banking applications have 50TB. These files include front-end Apache log files as well as applications logs. Bank transactions, requests for help and complains from call centre. This datasets contains bank transactions, requests for help and complains from 1 million unique customers. An average number of interactions per customer is 35. Typical size of an interaction is 10kB. The total size of the dataset is 350GB. CRM information about clients with historical values include personal information about customers such as employment, education, age, family status. Dataset includes current and historical information for about 6 millions clients with typical size 100kB per client. Total size is 600GB. Direct debits and standing orders of bank clients with historical values. The typical number of standing orders and direct debits per client with historical values is 50. A size of single record is 1kB. The total size of dataset for 6 millions clients and 50 records per client is 300GB. Product subscriptions data for all clients with complete history. A typical number of current and historical subscriptions per single client is 12. This includes accounts, mortgages, loans, credit cards and other bank products. We have 6 millions clients multiplied by 12 average number of subscriptions per client and multiplied by 1kB per subscription is 72GB. Customer's data from branch visits. This dataset includes over-the-counter bank transactions, help requests, product subscriptions and cancellations and complains. Typical number of interactions per client is 10. We do have large differences in utilization of branch services among clients. 3 millions clients and 10kB per interaction means 300GB. Dataset of debtors and dataset of failed applications for debt products. The total size of 1 million records in these datasets is 1GB Help files usage from mobile and internet banking. 6 millions users multiplied by 1000 average number of clicks to help files multiplied by 1kB an average size of the record is 6TB The total size of all primary datasets is 156TB. The result is calculated as a simple sum such as: 75TB + 50TB + 23TB + 6TB + 600GB + 350GB + 300GB + 300GB + 72GB + 1GB = 156TB. We can reduce overall size by using compression and we can remove technical fields that do not carry any business meaning from the datasets. Log files are also reduced by removing lines with no business meaning. Implementation Steps Isolation of sensitive data from Big Data analytics In order to isolate Big Data analytics from sensitive data we remove clients' names, addresses, telephone numbers and emails during data export processes. The next step is to create process that replaces real credit and debit card numbers, account numbers and customer's Ids by randomly generated numbers. These randomly generated numbers must be identical for the same entity across different datasets to enable analytics. This process stores pairs of matching real numbers and randomly generated numbers into tables. These tables are stored in separate secure relational database that is continuously updated. This database is also used to match randomly generated numbers with real numbers after Big Data analysis are performed. This enables isolation of data scientists and administrators from sensitive information that is only accessible to authorized bank's employees. Extraction, Transformation and Loading of Primary Datasets We do have initial ETL (Extraction, Transformation and Loading) of data and continuous processes of daily or hourly updates and imports of recent data from production systems of the bank. Initial extraction was performed by bank's production and backup systems. Data was extracted in denormalized text form in CSV or fix length field formats. This form is an ideal for bulk uploads into Big Data systems. Denormalized form uses concrete values instead of reference Ids as in relational databases. Continuous data exports are channeled via JMS, MQ Series, CSV files and via Sqoop. Exported data are picked up by Big Data scripts such as Pig or Hive. These scripts are triggered via Oozie processes. Transformation of Input Data Transformation rules and scripts are shared by initial and continuous ETL processes. We have used Pig and Hive scripts and Java written UDF (User Defined Functions) to perform transformation steps. Oozie workflows were used to chain transformation steps. We have used several practical rules for data transformations: Various file formats are separated into its own directories inside HDFS (Hadoop file system) Unprocessed and failed records are written into specific directories for manual investigation. Intermediate result files are deleted after all transformation steps are successfully performed. This saves HDFS space as well as enable to investigate and re-run incomplete transformations. 4/5
  5. 5. Syoncloud Big Data for Retail Banking | Syoncloud 14/10/2013 Pig and Hive scripts are kept simple and single purpose. This enables easy debugging and re-use. Java UDFs are only used if given function was not available in standard library or in PiggyBank library. Transformation scripts are reused for processing updates. Pow ered by Drupal 5/5