Making Bank Predictive and Real-Time


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • My name is Anurag Shrivastava.I lead an engineering team that builds data platform and customer intelligence solutions.I work for ING bank which is a very large retail bank in the Netherlands.Stand still and talk slowly
  • Everyday we get tens of email containing one promotion or another.We throw away most of them. However, if you get an email for special cake price two days before wife’s birthday, you are likely to be happy. Most of marketing can go waste if you do not know what you want. Right offer at wrong moment rarely works.Somebody tries to use your credit card in Australia when you are watching my presentation. How soon your credit card company know that and inform you?
  • Data silos have come up over a period of time because of the specific needs of a value chain. This causes data duplication and several point to point interfaces.Batch processing is based upon the processing the files at intervals. Due to multiple data silos, we have to process the same file several time leading to complex ETL routines and short window for fault recovery.At this moment, analytics is limited to analytics on structured data for marketing purposes. Data is analyzed at rest. Model is built and deployed for the campaigns.We process around 1000 batch files daily.
  • Netezza seemed to a great idea because it offered consolidation and lower maintenance overhead. It also much faster than our Oracle based DWH. However this was before Hadoop shot into prominence. This decision was taken in 2011 when we were not familiar with Hadoop. The cost of Netezza is high. We also had to redo our ETL.
  • Explain how this stack has been built starting from bottom to top.
  • First mover disadvantage in contrast with Oracle which is well known.BI departments are SQL focused. Stacks do not change a lot over a long period of time.Big Data companies are very small. When our CIO/CFO visited these companies in the Silicon Valley, they were surprised by their small size.
  • Data centres have designed for High End server/ Hadoop works on cheap serversRisk of fire/Risk ot shutting down the entire networkVirtualization is used heavily. Concept of data locality are foreign here.Large IT organizations share a lot: Monitoring, Build and Backup is shared so a new system has to be compatible with it or it becomes special and cost of service goes up.A IT Infra engineer sees Hadoop as an elephant in a zoo full of tigers.
  • Start small, learn and move on.
  • Once you combine data from many sources, its sensitivity increasesFine grained security on Hadoop is still not ready.
  • We settled for hardware and software recommended by our data centre though we could have saved more ( and spent more time) with cheaper hardware.We used HP SL4540. One node EUR 16000 for 16 core CPU Approx 30 TB per node.You do not want to install 18 nodes manually.
  • Every new tool means following a cumbersome change process.We choose Hortonworks because of their clean open source approach.Innovation can be tempting but you can not implement every new tool.
  • Explain the purpose of each block.
  • So we can build predictive model but how to deploy them has to be figured out.Real time – yet to be done.
  • Making Bank Predictive and Real-Time

    1. 1. Making Bank Predictive and Real-Time Anurag Shrivastava ING Bank
    3. 3. What is predictive & real-time?  Predictive  Ability to predict the behavior of customer based upon data and generation of appropriate actions  Real-Time  Ability to take appropriate actions near real-time based upon the events generated during customer interactions 3
    4. 4. Data Platform: Challenges Data Silos 160 Oracle Instances, 70 TB 30 Data Warehouses Batch Processing 120000 ETL Jobs, OWB, ODI Limited Analytics Segmentation for Campaigns Using SAS 4
    5. 5. Data Platform: Initial Thoughts • Less effort required in database maintenance • 20 to 60 Times Query Performance Improvement • Hardware Maintenance is handled by the vendor • Powerful data transformation and analytics capabilities with the help of accelerators • Fits well in a large data center • Expensive data storage EUR 15K to 70 K per terabyte • Proprietary technology and hardware • Explosive growth in data storage requirements due to logs from online channels mean more investment is needed in Netezza All the trademarks and copyrights are acknowledged by the author. 5
    6. 6. Data Platform Target Architecture Enterprise Data Overlay (Datastage,…) BI Tools and Applications Enterprise SystemsExternal Systems Data Marts (Netezza) Enterprise Data Warehouse (Hadoop) Predictive Analytics Lab (Hadoop) 6
    7. 7. Challenges with Hadoop in a Large Bank  New technology – first mover disadvantage  Experienced people are hard to find  Attention and hype from CXO (read pressure to deliver)  Do we really have a big data problem?  No clear leader in the vendor space  Open source and Java focused community 7
    8. 8. IT Challenges in a Large Bank High End Servers Virtualization Storage Area Networks Shared Services Build Server, Monitoring & Back-up etc. 8
    9. 9. Our Journey Play Area Big Data Predictive Analytics Lab Production System 9
    10. 10. Play Area Big Data  Goals  Quickly learn about Hadoop Capabilities  Create interest and awareness in the organization  What we did?  Setup a small Hadoop cluster with old unused HP blades in a test area  Get started quickly with a distribution recommended by a consulting company  One time old data load, no ETL, no scheduling etc.  Small team experimenting data using R 10
    11. 11. Predictive Analytics Lab  Goals  Capability to build predictive models for business cases  Secure environment designed for data scientists who can build predictive models  What we did?  Hadoop cluster with brand new Hardware  Managed to install it in a data centre  Secure and monitored  Based upon a HDP 2.0 11
    12. 12. Implementation Challenges (1/3)  Securing Hadoop  Strong perimeter security to a limited set of users  Multi-factor authentication  Stepping stone to Hadoop cluster using Citrix  Enterprise repository for deployment with pre- screened jars for Trojans and Malware  Monitoring for various events 12
    13. 13. Implementation Challenges (2/3)  Hardware  Data Centre not ready for cheap/commodity hardware  Automated deployment only possible on VMs so no install possible on bare metal  Compromise between costs and DC standards  Automated Provisioning of Hadoop  Ansible for automated provisioning of 18 nodes  Ambari for monitoring the cluster  It is easy to automate the provision and highly recommended 13
    14. 14. Implementation Challenges (3/3)  Rapid Pace of Innovation  Hadoop community is very active on innovation front  As we built our Hadoop cluster, new names such as Spark, Accumulo and Falcon pop-up  Infra processes are waterfall based forcing a pause every time a new tool pops-up  Number of distributions to choose from 14
    15. 15. Predictive Analytics Lab Stepping Stone (Citrix) 18 x Hadoop Nodes GIT, Libraries, Build Tools Monitoring Services Data Files in Batches Dedicated VLAN Shared ServicesShared Services SMTP Relay Internet via Corporate Infrastructure Firewall Rules Guard the Perimeter Security Of Hadoop Cluster 18 x Hadoop Nodes 15
    16. 16. Predictive Analytics Lab  Scrum – 3 week long sprint  Data Scientists, Hadoop Engineers in the team  Every sprint demonstrates working software to the stakeholders  Hortonworks HDP 2.0  Hive  Ambari and Ansible  R Studio  Hue, HCatalog Team and Process Lab Environment 16
    17. 17. Our Journey Production System Play Area Big Data Predictive Analytics Lab 17
    18. 18. Production System  Goals  Meet diverse informations needs of business  Deploy predictive model for production  Cut-down data storage costs without compromising the reliability and availability  What we need?  Fine grained security  ETL and workflow tools  Automated Deployment of Predictive Models  Disaster recovery 18
    19. 19. Real-time  Hadoop is a batch processing system not designed for real time analytics  A predictive model that has to perform near real-time would require a deployment platform different from Hadoop  Real time means near real-time or micro- batches  Candidate tools for evaluation: Storm, Spark and Infosphere Streams 19
    20. 20. Business Cases Improve the segmentation for marketing Personal spending forecasPredict mortgage defaulters 20
    21. 21. Hadoop Benefits  Data Hub or Enterprise Memory  Schema on read  Cheap but reliable storage  Fault Tolerant  Lower cost of hardware and licenses  Data Driven Applications  Run complex queries and predictive analytics models  Build predictive models  Increase revenue and lower risk 21
    22. 22. Lessons learned  Hadoop is ready for early adopters; it can save you costs and accelerate predictive analytics  Hadoop is not the complete solution to build a real-time and predictive platform  Business case driven experimentation has greater chance of acceptance than pure technical exploration in a large enterprise  External expertise and close link with the community is valuable 22
    23. 23. Contact Details or Twitter: @shri2201 Questions23