Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT

109 views

Published on

Standard Bank is a leading South African bank with a vision to be the leading financial services organization in and for Africa. We will share our vision, greatest challenges, and most valuable lessons learned on our journey towards enterprise adoption of a big data strategy.

This includes our implementation of: a multi-tenant enterprise data lake, a real time streaming capability, appropriate data management and governance principles, a data science workbench, and a process for model productionisation to support data science teams across the Group and across Africa and Europe.

Speakers
Zakeera Mahomen, Standard Bank, Big Data Practice Lead
Kristel Sampson, Standard Bank, Platform Lead

Published in: Technology
  • Be the first to comment

Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT

  1. 1. Driving Enterprise Adoption: Tragedies, Triumphs and our NEXT Standard Bank South Africa
  2. 2. Big Data Journey: Lake 2 Setup POC Environment for Data Science Exploration Begin Data Lake Journey Get critical mass (Enterprise Data) ingested into Lake Create a multitenant Lake environment Start onboarding tenants (Data Science teams) onto Lake. Each team has ‘sandbox’ environment to start prototyping models Setup Data Science Workbench on Lake Finalise Model Productionisation process Implement Real time Streaming capability Productionise first data science (Real time Machine Learning) model on Lake Establish Data Science Guild Establish Security Policies for Data access Integrate Lake metadata into Enterprise metadata repository Move to use case driven prioritization for DS model productionisation 2016 2017 2018
  3. 3. Successes • Security • Getting critical mass of Enterprise Data into the Lake • Establishment of our Data Science workbench • Defining the model productionisation standards
  4. 4. Challenges • Security • Integrating with other systems (Kerberos) • Open source Connectors (Kerberos) • Proprietary Connectors (SAS/ SAP/ Actimise/ IBM) • Skills gap • Data Lake vs Hadoop and Strategy • No demand to oversubscription
  5. 5. Big Data Landscape
  6. 6. 5 Pillars of Enterprise Security on Hadoop Pillar Intent Tool Pillar Administration How do I set policy? Ambari/Ranger Administration Authentication Who am I? Kerberos/LDAP Authentication Authorization What can I do? Ranger/Centrify Authorization Audit What did I do? Ranger/Centrify Audit Data Protection How can I encrypt data? Ranger KMS/SSL Data Protection 6
  7. 7. Metadata Strategy
  8. 8. Place your screenshot here 8 Building a Lake and not a Swamp
  9. 9. Edge Node 2 Edge Node 2Edge Node 1 CIA Data Services Enterprise Lake Proprietary Data Science Workbench KVM (Active/Passive) Load Balanced Virtual Machines Application Development Test Application Development Test Repo (in DMZ) e.g. R Managed Queues Managed Storage Common OS Apps Common OS Apps Production Workbench Approved list of commonly used open source apps Setting up multiple tenants
  10. 10. Edge Node 1 Data Services Enterprise Lake Edge Node Virtual Machines Managed Queues Managed Storage Self Service on the Edge node Data Science Workbench 2017 Q1 2018 Feb 2018 Ability to build streaming data pipelines R Studio Enterprise Server Ability to source data onto Edge node • Up to 10 TB and copy into their own Dev folder(1TB) on HDFS to run distributed R Studio + Jupiter Notebooks + Spark R – sufficient tooling available to build models Data Science team setup with own edge node Enable Power BI? Ability to install applications on the Edge Node
  11. 11. Private Cloud: Africa Regions Step 1: Near RT and Batch Internal: CDC HDF: Kafka/Nifi/ Storm External: Streaming Internal/ Ext : file based EL: Abinitio/Spark Feature Extraction Batch Model Training Spark model Model Results via API Data Persiste d: HDFS/ Hbase/ Elastic/ etc Exposed model (PMML) Continuous Integration Model via API Model deployed/ replaced Regional Systems South Africa: Data Lake Regional Reservoir and Data Warehouse In Country Feature Extraction Batch Model Training Spark model Exposed model (PMML) Continuous Integration
  12. 12. Private Cloud: Africa Regions Step 1: Near RT and Batch Internal: CDC HDF: Kafka/Nifi/ Storm External: Streaming Internal/ Ext : file based EL: Abinitio/Spark Feature Extraction Batch Model Training Spark model Model Results via API Data Persisted: HDFS/ Hbase/ Elastic/ etc Exposed model (PMML) Continuous Integration Model via API Model deployed/ replaced Regional Systems South Africa: Data Lake Regional Reservoir and Data Warehouse In Country Feature Extraction Batch Model Training Spark model Exposed model (PMML) Continuous Integration • Model Training happens off SA Lake • Africa Regional data ingested into SA Lake for Data Science • Results can be made available to Africa Region Systems via API • Can accommodate Batch and near real-time data science models
  13. 13. Data Science Workbench and Tools IT Engineering Data Science Not many standards
  14. 14. Roles and Responsibilities Data Lake Interactions Data Scientist DSC Business Requirements idea Data Data Source requirements Model Design Predictive Model Model Testing Processing and Vizualisation Optimize Model Model Optimization Launch to Production Trigger Model Productionization Monitor Production Monitor Model Production Data Engineer DEV Data Data Source requirements ETL development and productionization Launch to Production Serialize Model Production Move to Production Platform Infra Engineer OPS Business Requirements Setup Project for Data Existing Data Subscribe to Existing Data New Data New Data Ingestion Pipeline Deployment and Subscription Monitor Resources Access Tools and Quota Resources Launch to Production New Project Deployment Existing Data New Data Source Monitor Production Job Performances Monitor Production Execution / Queues Capacity
  15. 15. D S BDE No of Sources Variety of data types Number of Use cases Complexity of use cases % of workflow and automation Data Science technical skill BDE BDE D S D S D S D S D S D S D S D S D S D S D S BDE Backlog of Prod Velocity Streaming Data Science to BDE ratio
  16. 16. DRIVING PRINCIPLES We are a community of like-minded professionals and enthusiasts who share the common goal of teaching, learning and shaping the future of Data Science within Standard Bank Group. Our focus is on building a local community of practitioners that can effectively share knowledge, best practices, and provide opportunities for collaboration across business units and functional areas. We seek to consolidate needs and preferences for Data Science technologies across individuals and teams, bringing a unified vision for Data Science to our Big Data environments. We share our thoughts and ideas. We work with openness with the understanding that advancement depends on collaboration and shared learning. 2017 OFFICE HOLDERS OBJECTIVES • To advance Data Science principles across lines of business, using a common practice definition • To provide guidance and direction to practitioners • Establish policies, standards and processes for the application of Data Science use-cases on shared production environments • To socialise Data Science use-cases, success stories, and stumbling blocks for shared knowledge • Provide professional education standards and training pathways MISSION STATEMENT The Data Science Guild is a technical and data-savvy group of practitioners discussing the application of artificial intelligence and machine learning across the Standard Bank Group. We aim to guide, assist, and improve the development and productionisation of machine learning algorithms and statistical models on our Data Lake and Data Reservoir environments. FUNCTIONAL SCOPE MEMBERSHIP PROPOSITION Join the Data Science Guild in order to advance your individual capabilities and to get exposure to an array of use-cases and methodologies. Join one of the working groups to directly contribute to shaping the practice of Data Science within Standard Bank Group. Get exposure to the developing curriculum of Data Science training opportunities targeted to your individual level of practice and expected business application. We run monthly meetings that include: • Overview from working groups (education, tooling, and productionisation standards) • Demonstration(s) of use-cases from teams across business units • Connect sessions for practitioners to network across teams. The Data Science Guild is mandated by the Enterprise Data Committee and forms part of the Data Community of Practice. The Guild has the responsibility of representing the Data Science professionals in the Group and ensuring that they are equipped with the education, tools and means by which Data Science assets can defined, controlled, used and communicated for the benefit of the Group and its component business entities. TERMS OF REFERENCE Data Science Guild 16 Service Offering Supported Capabilities Owned Toolsets Service Offering Business Data Science Executive Education Presentation materials /collateral Business Data Science Technical Data Science Knowledge Sharing Code repository Technical Data Science Operational Data Science Data Science Tools Data Science Workbench Operational Data Science Data Science Model Productionisation Defined Productionisation Standards Education Grad Training Programme Michelle Gervais Chair Kristel Sampson Deputy Chair TBD Membership TBD General Professional Development / Events TBD Working Group Lead: Productionisation TBD Working Group Lead: Training Programme
  17. 17. Data Lake Model Dev Model Test/ Train Model Serialisation DataScientist Prod Model Predict Prod Model Train DataEngineerBoth Monitor Model Predict Monitor Model Train If Model Predict Decays – Replace with Model train Productionise data Option 1 Option 2 (Preferred) Deploy Server Data Lake Mode l Dev Mode l Test/ Train Model Serialisation DataScientist Prod Model Predict Prod Model Train DataEngineerBoth Monitor Model Predict Monitor Model Train If Model Predict Decays – Replace with Model train Productionise data Scalability, volume of data, no of users, low latency Deployment: Model Predict On and off the Lake
  18. 18. Spark Serialisation Deep Learning model serialisation Python Serialisation J S O N Serialisation of different types of models need to be investigated. One of the main goals of model serialization is to have the ability to possibly embed a model into a production system. Hence some of the available options wrt serialization may not be viable – Python. Having one standard like JSON may not be possible either as it may not cater for complex models. Model serialization requires unpacking and some prototyping. Model Serialization
  19. 19. What are the AI Use Cases? One Size Fits All The Data Services team is building a suite of anomaly detection models which will solve for domains such as : ◂Software testing ◂Customer behavior monitoring ◂Trading patterns ◂Price formation ◂System performance (servers, networks, OS, software) ◂Fraud detection ◂Customer support Who Are You? The Security team have sponsored a facial recognition engine to enhance security for digital channels. Leading research has excluded an adequate corpus of African faces, hence the need for a custom solution 19 The Price is Right! Markets have become more interconnected and data-driven. Using AI, the Global Markets team is becoming more efficient and competitive in our market-making, risk management, pricing and execution. We use our data to understand the actions of markets participants and to change the way we react to our trading environments.
  20. 20. What are the AI Use Cases? Shap! Eish! Hujambo! Modern day research concerning sentiment analysis and natural language processing techniques have focused on the English language. Our approach is to build models based on vernacular languages in the regions within which we operate. 20 Work Smarter, Not Harder The Intelligence Automation team have been automating a number of business processes including account origination. The next generation pipeline business processes to be automated will include embedded artificial cognition. . Show Your Money Who’s boss Standard Bank is making your financial management personal. Using the latest machine learning techniques, we have developed a prototype, commissioned by the PBB SA Digital team, which produces an accurate and customized forecast of a customer’s upcoming transactions. We Are All Connected Using the power of distributed computing, we are building graph database capabilities to connect our customer records across independent databases and systems. The Nigeria Ecosystem model, based on graph, is helping to generate leads for CIB.
  21. 21. What’s our next? 21
  22. 22. Our Next: Model Productionisation Python/Spark Machine Learning Analytical Data Science Workbench setup with Anaconda - Repo Setup with Spark R – but update required to Python and request for Sparklyr and other R Statistical modelling Tensorflow (CPU) Deep Learning Setup Cluster – linked to HDFS, running on CPU not GPU Graph Entity Linking Tactical – Neo4J Installation per user on edge node
  23. 23. Our Next • Cloud • Microservices Architecture • Self Service (Business units enabled) • Data Tokenisation
  24. 24. 24 Thanks! Any questions? You can find us at ◂ Kristel.Sampson@standardbank.co.za ◂ Zakeera.Mahomed@standardbank.co.za
  25. 25. Icons are editable shapes. This means that you can: ● Resize them without losing quality. ● Change fill color and opacity. ● Change line color, width and style. Isn’t that nice? :) Examples: 25

×