Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning & Artificial Intelligence - Machine Controlled Data Dispensation


Published on

by Muthuvenkatesh Sivakadatcham, Principal Test Consultant & Karthikeyan Mani, Technology Test Lead, Infosys at STeP-IN SUMMIT 2018 - 15th International Conference on Software Testing on August 30, 2018 at Taj, MG Road, Bengaluru

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Machine Learning & Artificial Intelligence - Machine Controlled Data Dispensation

  1. 1. © 2016 Infosys Ltd. Machine Learning & Artificial Intelligence - Machine Controlled Data Dispensation 1 Muthu Venkatesh - Principal Test Consultant Karthikeyan Mani - Technology Test Lead
  2. 2. 2 • Handling customer sensitive data has always been a challenge for organizations in Banking , Insurance and Healthcare domains. • While building and validating data analytics model has gained huge importance to stay afloat amidst the stiff competition from peers , it is also of paramount importance that the data dispensation rules are adhered to . • There are well established traditional methods for identifying Sensitive Information so that they can be masked to prevent interpretation with a malicious intent by intruders or any third party involved in validation of data. • But , the pain point is in identifying the sensitive information either through manual intervention (or) by automation through coding demands lot of effort and continuous updation of scripts. • This paper throws light how machine can be deployed for the same with very minimal intervention from the user. • Testers / Test Leads / Test Managers/Business Analysts & Data Scientists will benefit from this thought/idea Abstract
  3. 3. 3 Table of Contents • Introduction / Background • In-Scope/Out of Scope • Step 1 – Training Set Creation • Step 2 – Test Set Creation • Step 3 – Fine Tuning Algorithm • Step 4 – Validation • Demo
  4. 4. © 2016 Infosys Ltd. Introduction 1 2 3TRAINING SET CREATION 4TEST SET CREATION ALGORITHM TUNING VALIDATION • Increased focus on data security across various domains • Identification of sensitive data across systems is most challenging for many organisations today MACHINE LEARNING – Easy 4 Step Process
  5. 5. 5 • Identifying sensitive data stored in database schemas • Identification based on selected attributes of each column in schema Out of Scope In-Scope • Current solution does not handle other aspects of data regulation compliances like GDPR • Columns with free text like blogs, chat histories etc are not analyzed in current solution . They would be treated as a distinct value at a high level • Data from sources other than datastores are not handled as part of this exercise
  6. 6. © 2016 Infosys Ltd. Step 1 – Training Set Creation Parser program runs on schema tables and creates training set with non-null values and stores it in the training set file. Fields/Columns needed for Sensitive data determination are pulled out. Column name, Max column length & non-null Value Database Parser program Training set with non- null values File 1 (Training_Set)
  7. 7. © 2016 Infosys Ltd. 7
  8. 8. © 2016 Infosys Ltd. 8
  9. 9. © 2016 Infosys Ltd. Step 2 – Test Set Creation Parser program runs on schema tables and creates test set with non-null values and stores it in the test set file. Column name, Max column length & non-null Value Database Parser program Training set with non- null values File 2 (Test_Set)
  10. 10. © 2016 Infosys Ltd. 10
  11. 11. © 2016 Infosys Ltd. Step 3 – Fine Tuning Algorithm Algorithm runs iteratively on the data in training set file and iterations are repeated till we achieve 100% accuracy. Run Algorithm on Training Set Validate Accuracy of Algorithm Is Accuracy = 100 % Freeze Algorithm for Test Set Fine Tune Algorithm - Adjust no of rows, columns , neighbors etc., Y N File 1 (Training_Set)
  12. 12. © 2016 Infosys Ltd. Step 4 – Validation Algorithm runs on test set file and compares its predictions from training set file and stores it in another file (Recommendations_File). Run Frozen Algorithm on Test Sets Validate Output of Algorithm Persist output to File 3 File 2 (Test_Set) File 3 ( Recommendations from Algorithm)
  13. 13. © 2016 Infosys Ltd. 13
  14. 14. © 2016 Infosys Ltd. Case Study 14
  15. 15. © 2016 Infosys Ltd. Key Considerations:  Usage of Java to implement the Machine Learning Algorithm  Reduction in time consumption and human effort  Better accuracy in identification of sensitive fields.  The Machine Learning algorithm will not be written. Instead, it shall be acquired from an open source platform. Context 15 Objective: Sensitive Data Discovery-To identify sensitive fields in a target database containing sensitive as well as non-sensitive information. Scope: • The target database has around 800 tables. • Implementation of an Algorithm based Machine Learning for identifying Sensitive Fields. • Output of the Machine Learning Algorithm needs to be compared with the manual analysis to arrive at the accuracy. • The ML PoC will provide a human readable output.
  16. 16. © 2016 Infosys Ltd. Overall Benefits • Algorithm Accuracy – Our Machine Learning Algorithm is able to train efficiently and is able to get ~96% accuracy on the datasets when executed in the test environment. – Percentage of Training Data – 15-20% – Percentage of Test data – 80-85% – Algorithm Used : Naïve – Bayes • Performance – We also conducted a performance evaluation on the run time of algorithms. – The timings for code execution on some of the test data sets are below: 16 Table Size Training set creation Training set run Test set creation Algorithm run Total time 5598 MB 59 s 20 s 57 s 50 s 3 m 6 s 600 MB 45 s 14 s 45 s 38 s 2m 22 s 22 MB 32 s 11 s 31 s 29 s 1 m 43 s 1732 MB 49 s 17 s 48 s 41 s 2 m 35 s
  17. 17. © 2016 Infosys Ltd. ML Based Data Dispensation – Features and Scalability • Simple Plug & Play – For any kind of database(s) & files • Custom data types – Creating customized search patterns • Accuracy and Reporting – Automated custom reporting • Easy to use – Implementation of a front end based solution • Low cost solution • Easy Maintainable & Scalable • Lower tool configuration effort – For conventional TDM tools, there is a need to configure the pattern for each new sensitive field type. An ML based solution will learn on its own. Therefore it is a more efficient way of approaching the problem of Sensitive Data discovery. • Continuous improvement – The output of the ML based solution will improve over a period of time. The conventional approach does not provide any such benefit.
  18. 18. © 2016 Infosys Ltd. Thank You 18