SlideShare a Scribd company logo
1 of 26
Download to read offline
Assurance Scoring:
Using Machine Learning
and Analytics to Reduce
Risk in the Public Sector
Matt Thomson
Natalia Angarita-Jaimes
8/5/2016
2Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Outline
Introduction
Traditional Fraud Detection
Assurance Scoring
Machine Learning
Business Rules
Anomaly Detection
Graph Links
3Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Who are we?
Matt Thomson
 Senior Data Scientist at Capgemini
 PhD in Astrophysics (http://arxiv.org/abs/1010.3315)
 Several years experience in fraud detection
Natalia Angarita-Jaimes
 Data Scientist at Capgemini
 PhD in Optical Engineering
 Several years experience signal and image processing.
Capgemini
 Big Data Analytics team
 30 Data Scientists, 40 Big Data Engineers
 Focus on Open Source and Big Data technologies to solve client problems
 Sponsor the conference!
4Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Introduction to the Problem
Public sector constantly working in an environment of reduced resources
Want to provide a better service but with greater efficiency
Therefore very important that limited resources are focussed correctly
Assurance Scoring
 Use ML and other analytical methods to identify the least risky people or applications so
that investigators resources can be targeted on the most risky
5Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Hypothetical Example – 2016 Olympics tickets
Running the application process for selling tickets to the 2016 Olympics
Avoid selling tickets to touts/resellers
 Vast majority of people applying for tickets are genuine
 Fraud detection with big class imbalance problem (<0.1%)
 Avoid approach of investigating each person applying
Lets say we know from 2012 Olympics which people ended up reselling
their tickets – training data
Use ML to identify the least risky 30% (say) of people wanting tickets
Investigators focus on the high risk
6Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Traditional Fraud Detection
Identify
Historical
Training Data
Feature
Engineering
Model
Training and
Evaluation
Model
Execution
Feedback
7Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
Focus on low-risk
Allows resources to be better focussed
Not limited to Machine Learning
Built using Python!
 Pandas, Scikit-learn etc
8Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
9Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
POLE ‘Analytical’ Data Layer
Disparate data sources - Atomic Layer
Atomic data is
Transformed and
Loaded into POLE
POLE Layer
EventLocationObjectPerson
10Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
POLE ‘Analytical’ Data Layer
POLE contains ALL entities from the Atomic Layer, plus their inter-linkages
11Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
12Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Machine learning
Transform Selection Model
Training
Validation
Test
Feature extraction and selection Model Building
Variety of output files: logs, graphics, pickle models, etc
Testing: Unit tests, monitoring tests and integration tests
Vector Build
Input Data
Manipulate, Explore
Data
Framework: Structure, flexibility, consistency
13Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Machine learning : Feature Engineering
SQL, Python
Transform
Explore
Select
Ask
questions,
validate
Refine
features
• Feature Extraction
• Data exploration
• Feature selection
Historical Data
14Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Machine Learning: Model Building
Training
Validation
Test
Split Datasets
Build
Models
Hyper-
parameter
tuning
Selected
features Models
Training
results
Validation
results
Tests
results
Compare
Models
15Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Low risk? High risk? Depends on classifier’s
threshold
• True-positives : applications the
model correctly classifies as high
risk
• True negatives: applications model
correctly classifies as low risk
• False-positives: applications the
model scores as high risk but are
not
• False-negatives: applications the
model scores as low risk but were in
fact high risk
16Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
17Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Business Rules
Identifying Fraud often been done using deterministic rules
Look for transactions near a threshold or at the end of the day
Primarily data queries on your feature vector
Olympics example – Anyone applying for more than £10,000 tickets
18Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
19Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Anomaly Detection
Use the training data to create a baseline of applications by postcode
(say)
If a particular postcode has a larger than expected number of applications
then those cases pushed into high-risk bucket
20Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
21Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Graph Links - Matching
Key part of assurance scoring – bringing data together from disparate
sources
Probability of Match: 80%
Attribute Data Source 1 Data Source 2
Name Matt Thomson Matthew Thosmon
Phone Number 07123 456 789 07123 456 798
Favourite Sport Football Cricket
22Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Assurance Scoring
23Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Jupyter Notebook
24Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
Further Details
matt.thomson@capgemini.com / @MattGThomson
Assurance Scoring brochure: http://ow.ly/4nbEUI
Blogs:
 Introduction: https://www.capgemini.com/node/1380596
 Integrating multiple techniques: http://bit.ly/24BmszV
 Machine Learning: http://bit.ly/1QTMGnq
 More coming soon!
25Copyright © Capgemini 2012. All Rights Reserved
Presentation Title | Date
We’re Hiring!
Data Science
https://www.uk.capgemini.com/careers/jobs/data-scientist-0
Big Data Engineer
https://www.uk.capgemini.com/careers/jobs/big-data-engineer
Data Visualisation Analyst
https://www.uk.capgemini.com/careers/jobs/data-visualisation-analyst
matt.thomson@capgemini.com
The information contained in this presentation is proprietary.
© 2012 Capgemini. All rights reserved.
www.capgemini.com
About Capgemini
With more than 120,000 people in 40 countries, Capgemini is one
of the world's foremost providers of consulting, technology and
outsourcing services. The Group reported 2011 global revenues
of EUR 9.7 billion.
Together with its clients, Capgemini creates and delivers
business and technology solutions that fit their needs and drive
the results they want. A deeply multicultural organization,
Capgemini has developed its own way of working, the
Collaborative Business ExperienceTM, and draws on Rightshore ®,
its worldwide delivery model.
Rightshore® is a trademark belonging to Capgemini

More Related Content

Viewers also liked

Ultimate Credit Score Guide
Ultimate Credit Score GuideUltimate Credit Score Guide
Ultimate Credit Score GuideEric Stephenson
 
modèle de scoring pour la clientèle
modèle de scoring pour la clientèle modèle de scoring pour la clientèle
modèle de scoring pour la clientèle Oulaya CHOUAY
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...DataWorks Summit/Hadoop Summit
 
(French) Le scoring au marketing à Orange France - by Claude Riwan - PAPIs Co...
(French) Le scoring au marketing à Orange France - by Claude Riwan - PAPIs Co...(French) Le scoring au marketing à Orange France - by Claude Riwan - PAPIs Co...
(French) Le scoring au marketing à Orange France - by Claude Riwan - PAPIs Co...PAPIs.io
 
Credit Risk Management Presentation
Credit Risk Management PresentationCredit Risk Management Presentation
Credit Risk Management PresentationSumant Palwankar
 

Viewers also liked (6)

Ultimate Credit Score Guide
Ultimate Credit Score GuideUltimate Credit Score Guide
Ultimate Credit Score Guide
 
modèle de scoring pour la clientèle
modèle de scoring pour la clientèle modèle de scoring pour la clientèle
modèle de scoring pour la clientèle
 
Slides axa
Slides axaSlides axa
Slides axa
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
 
(French) Le scoring au marketing à Orange France - by Claude Riwan - PAPIs Co...
(French) Le scoring au marketing à Orange France - by Claude Riwan - PAPIs Co...(French) Le scoring au marketing à Orange France - by Claude Riwan - PAPIs Co...
(French) Le scoring au marketing à Orange France - by Claude Riwan - PAPIs Co...
 
Credit Risk Management Presentation
Credit Risk Management PresentationCredit Risk Management Presentation
Credit Risk Management Presentation
 

Similar to Assurance Scoring Pydata London 2016

Protect Your Revenue Streams: Big Data & Analytics in Tax
Protect Your Revenue Streams: Big Data & Analytics in TaxProtect Your Revenue Streams: Big Data & Analytics in Tax
Protect Your Revenue Streams: Big Data & Analytics in TaxCapgemini
 
The future of FinTech product using pervasive Machine Learning automation - A...
The future of FinTech product using pervasive Machine Learning automation - A...The future of FinTech product using pervasive Machine Learning automation - A...
The future of FinTech product using pervasive Machine Learning automation - A...Shift Conference
 
Oracle Open World 2013 Case Management Smiers / Kitson
Oracle Open World 2013 Case Management Smiers / KitsonOracle Open World 2013 Case Management Smiers / Kitson
Oracle Open World 2013 Case Management Smiers / KitsonLeon Smiers
 
From Customer Insights to Action
From Customer Insights to ActionFrom Customer Insights to Action
From Customer Insights to ActionCapgemini
 
ISC2 Privacy-Preserving Analytics and Secure Multiparty Computation
ISC2 Privacy-Preserving Analytics and Secure Multiparty ComputationISC2 Privacy-Preserving Analytics and Secure Multiparty Computation
ISC2 Privacy-Preserving Analytics and Secure Multiparty ComputationUlfMattsson7
 
Big Data Analytics in light of Financial Industry
Big Data Analytics in light of Financial Industry Big Data Analytics in light of Financial Industry
Big Data Analytics in light of Financial Industry Capgemini
 
Cwin16 tls-iot approach-applied_in_the_plm_domain
Cwin16 tls-iot approach-applied_in_the_plm_domainCwin16 tls-iot approach-applied_in_the_plm_domain
Cwin16 tls-iot approach-applied_in_the_plm_domainCapgemini
 
New technologies for data protection
New technologies for data protectionNew technologies for data protection
New technologies for data protectionUlf Mattsson
 
CWIN17 New-York / earning the currency of trust
CWIN17 New-York / earning the currency of trustCWIN17 New-York / earning the currency of trust
CWIN17 New-York / earning the currency of trustCapgemini
 
Working Agile in an Ever Changing World
Working Agile in an Ever Changing WorldWorking Agile in an Ever Changing World
Working Agile in an Ever Changing WorldCapgemini
 
Infrastructure predictive monitoring with itoa jean louis baudoin, capgemini-...
Infrastructure predictive monitoring with itoa jean louis baudoin, capgemini-...Infrastructure predictive monitoring with itoa jean louis baudoin, capgemini-...
Infrastructure predictive monitoring with itoa jean louis baudoin, capgemini-...Capgemini
 
Infrastructure predictive monitoring with itoa jean louis baudoin, capgemini-...
Infrastructure predictive monitoring with itoa jean louis baudoin, capgemini-...Infrastructure predictive monitoring with itoa jean louis baudoin, capgemini-...
Infrastructure predictive monitoring with itoa jean louis baudoin, capgemini-...Capgemini
 
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Transforming Insurance Analytics with Big Data and Automated Machine Learning
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Transforming Insurance Analytics with Big Data and Automated Machine Learning
Cloudera, Inc.
 
Dashboards that Set Your App Apart: The Complete Predictive Analytics Lifecyc...
Dashboards that Set Your App Apart: The Complete Predictive Analytics Lifecyc...Dashboards that Set Your App Apart: The Complete Predictive Analytics Lifecyc...
Dashboards that Set Your App Apart: The Complete Predictive Analytics Lifecyc...Hannah Flynn
 
Dashboards that Set Your App Apart: The Complete Predictive Analytics Lifecyc...
Dashboards that Set Your App Apart: The Complete Predictive Analytics Lifecyc...Dashboards that Set Your App Apart: The Complete Predictive Analytics Lifecyc...
Dashboards that Set Your App Apart: The Complete Predictive Analytics Lifecyc...Aggregage
 
Protecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UKProtecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UKUlf Mattsson
 
Privacy preserving computing and secure multi-party computation ISACA Atlanta
Privacy preserving computing and secure multi-party computation ISACA AtlantaPrivacy preserving computing and secure multi-party computation ISACA Atlanta
Privacy preserving computing and secure multi-party computation ISACA AtlantaUlf Mattsson
 
Destroy Data Siloes at Digital Innovations to Advance Clinical Trials
Destroy Data Siloes at Digital Innovations to Advance Clinical TrialsDestroy Data Siloes at Digital Innovations to Advance Clinical Trials
Destroy Data Siloes at Digital Innovations to Advance Clinical TrialsSaama
 
Machine Learning for Finance Master Class
Machine Learning for Finance Master Class Machine Learning for Finance Master Class
Machine Learning for Finance Master Class QuantUniversity
 
Webinar: Why Commodity Analytics is the Next Big Thing for Trading, Risk, & S...
Webinar: Why Commodity Analytics is the Next Big Thing for Trading, Risk, & S...Webinar: Why Commodity Analytics is the Next Big Thing for Trading, Risk, & S...
Webinar: Why Commodity Analytics is the Next Big Thing for Trading, Risk, & S...Eka Software Solutions
 

Similar to Assurance Scoring Pydata London 2016 (20)

Protect Your Revenue Streams: Big Data & Analytics in Tax
Protect Your Revenue Streams: Big Data & Analytics in TaxProtect Your Revenue Streams: Big Data & Analytics in Tax
Protect Your Revenue Streams: Big Data & Analytics in Tax
 
The future of FinTech product using pervasive Machine Learning automation - A...
The future of FinTech product using pervasive Machine Learning automation - A...The future of FinTech product using pervasive Machine Learning automation - A...
The future of FinTech product using pervasive Machine Learning automation - A...
 
Oracle Open World 2013 Case Management Smiers / Kitson
Oracle Open World 2013 Case Management Smiers / KitsonOracle Open World 2013 Case Management Smiers / Kitson
Oracle Open World 2013 Case Management Smiers / Kitson
 
From Customer Insights to Action
From Customer Insights to ActionFrom Customer Insights to Action
From Customer Insights to Action
 
ISC2 Privacy-Preserving Analytics and Secure Multiparty Computation
ISC2 Privacy-Preserving Analytics and Secure Multiparty ComputationISC2 Privacy-Preserving Analytics and Secure Multiparty Computation
ISC2 Privacy-Preserving Analytics and Secure Multiparty Computation
 
Big Data Analytics in light of Financial Industry
Big Data Analytics in light of Financial Industry Big Data Analytics in light of Financial Industry
Big Data Analytics in light of Financial Industry
 
Cwin16 tls-iot approach-applied_in_the_plm_domain
Cwin16 tls-iot approach-applied_in_the_plm_domainCwin16 tls-iot approach-applied_in_the_plm_domain
Cwin16 tls-iot approach-applied_in_the_plm_domain
 
New technologies for data protection
New technologies for data protectionNew technologies for data protection
New technologies for data protection
 
CWIN17 New-York / earning the currency of trust
CWIN17 New-York / earning the currency of trustCWIN17 New-York / earning the currency of trust
CWIN17 New-York / earning the currency of trust
 
Working Agile in an Ever Changing World
Working Agile in an Ever Changing WorldWorking Agile in an Ever Changing World
Working Agile in an Ever Changing World
 
Infrastructure predictive monitoring with itoa jean louis baudoin, capgemini-...
Infrastructure predictive monitoring with itoa jean louis baudoin, capgemini-...Infrastructure predictive monitoring with itoa jean louis baudoin, capgemini-...
Infrastructure predictive monitoring with itoa jean louis baudoin, capgemini-...
 
Infrastructure predictive monitoring with itoa jean louis baudoin, capgemini-...
Infrastructure predictive monitoring with itoa jean louis baudoin, capgemini-...Infrastructure predictive monitoring with itoa jean louis baudoin, capgemini-...
Infrastructure predictive monitoring with itoa jean louis baudoin, capgemini-...
 
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Transforming Insurance Analytics with Big Data and Automated Machine Learning
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Transforming Insurance Analytics with Big Data and Automated Machine Learning

 
Dashboards that Set Your App Apart: The Complete Predictive Analytics Lifecyc...
Dashboards that Set Your App Apart: The Complete Predictive Analytics Lifecyc...Dashboards that Set Your App Apart: The Complete Predictive Analytics Lifecyc...
Dashboards that Set Your App Apart: The Complete Predictive Analytics Lifecyc...
 
Dashboards that Set Your App Apart: The Complete Predictive Analytics Lifecyc...
Dashboards that Set Your App Apart: The Complete Predictive Analytics Lifecyc...Dashboards that Set Your App Apart: The Complete Predictive Analytics Lifecyc...
Dashboards that Set Your App Apart: The Complete Predictive Analytics Lifecyc...
 
Protecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UKProtecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UK
 
Privacy preserving computing and secure multi-party computation ISACA Atlanta
Privacy preserving computing and secure multi-party computation ISACA AtlantaPrivacy preserving computing and secure multi-party computation ISACA Atlanta
Privacy preserving computing and secure multi-party computation ISACA Atlanta
 
Destroy Data Siloes at Digital Innovations to Advance Clinical Trials
Destroy Data Siloes at Digital Innovations to Advance Clinical TrialsDestroy Data Siloes at Digital Innovations to Advance Clinical Trials
Destroy Data Siloes at Digital Innovations to Advance Clinical Trials
 
Machine Learning for Finance Master Class
Machine Learning for Finance Master Class Machine Learning for Finance Master Class
Machine Learning for Finance Master Class
 
Webinar: Why Commodity Analytics is the Next Big Thing for Trading, Risk, & S...
Webinar: Why Commodity Analytics is the Next Big Thing for Trading, Risk, & S...Webinar: Why Commodity Analytics is the Next Big Thing for Trading, Risk, & S...
Webinar: Why Commodity Analytics is the Next Big Thing for Trading, Risk, & S...
 

Recently uploaded

Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 

Recently uploaded (17)

Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 

Assurance Scoring Pydata London 2016

  • 1. Assurance Scoring: Using Machine Learning and Analytics to Reduce Risk in the Public Sector Matt Thomson Natalia Angarita-Jaimes 8/5/2016
  • 2. 2Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Outline Introduction Traditional Fraud Detection Assurance Scoring Machine Learning Business Rules Anomaly Detection Graph Links
  • 3. 3Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Who are we? Matt Thomson  Senior Data Scientist at Capgemini  PhD in Astrophysics (http://arxiv.org/abs/1010.3315)  Several years experience in fraud detection Natalia Angarita-Jaimes  Data Scientist at Capgemini  PhD in Optical Engineering  Several years experience signal and image processing. Capgemini  Big Data Analytics team  30 Data Scientists, 40 Big Data Engineers  Focus on Open Source and Big Data technologies to solve client problems  Sponsor the conference!
  • 4. 4Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Introduction to the Problem Public sector constantly working in an environment of reduced resources Want to provide a better service but with greater efficiency Therefore very important that limited resources are focussed correctly Assurance Scoring  Use ML and other analytical methods to identify the least risky people or applications so that investigators resources can be targeted on the most risky
  • 5. 5Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Hypothetical Example – 2016 Olympics tickets Running the application process for selling tickets to the 2016 Olympics Avoid selling tickets to touts/resellers  Vast majority of people applying for tickets are genuine  Fraud detection with big class imbalance problem (<0.1%)  Avoid approach of investigating each person applying Lets say we know from 2012 Olympics which people ended up reselling their tickets – training data Use ML to identify the least risky 30% (say) of people wanting tickets Investigators focus on the high risk
  • 6. 6Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Traditional Fraud Detection Identify Historical Training Data Feature Engineering Model Training and Evaluation Model Execution Feedback
  • 7. 7Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Assurance Scoring Focus on low-risk Allows resources to be better focussed Not limited to Machine Learning Built using Python!  Pandas, Scikit-learn etc
  • 8. 8Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Assurance Scoring
  • 9. 9Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date POLE ‘Analytical’ Data Layer Disparate data sources - Atomic Layer Atomic data is Transformed and Loaded into POLE POLE Layer EventLocationObjectPerson
  • 10. 10Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date POLE ‘Analytical’ Data Layer POLE contains ALL entities from the Atomic Layer, plus their inter-linkages
  • 11. 11Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Assurance Scoring
  • 12. 12Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Machine learning Transform Selection Model Training Validation Test Feature extraction and selection Model Building Variety of output files: logs, graphics, pickle models, etc Testing: Unit tests, monitoring tests and integration tests Vector Build Input Data Manipulate, Explore Data Framework: Structure, flexibility, consistency
  • 13. 13Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Machine learning : Feature Engineering SQL, Python Transform Explore Select Ask questions, validate Refine features • Feature Extraction • Data exploration • Feature selection Historical Data
  • 14. 14Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Machine Learning: Model Building Training Validation Test Split Datasets Build Models Hyper- parameter tuning Selected features Models Training results Validation results Tests results Compare Models
  • 15. 15Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Low risk? High risk? Depends on classifier’s threshold • True-positives : applications the model correctly classifies as high risk • True negatives: applications model correctly classifies as low risk • False-positives: applications the model scores as high risk but are not • False-negatives: applications the model scores as low risk but were in fact high risk
  • 16. 16Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Assurance Scoring
  • 17. 17Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Business Rules Identifying Fraud often been done using deterministic rules Look for transactions near a threshold or at the end of the day Primarily data queries on your feature vector Olympics example – Anyone applying for more than £10,000 tickets
  • 18. 18Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Assurance Scoring
  • 19. 19Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Anomaly Detection Use the training data to create a baseline of applications by postcode (say) If a particular postcode has a larger than expected number of applications then those cases pushed into high-risk bucket
  • 20. 20Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Assurance Scoring
  • 21. 21Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Graph Links - Matching Key part of assurance scoring – bringing data together from disparate sources Probability of Match: 80% Attribute Data Source 1 Data Source 2 Name Matt Thomson Matthew Thosmon Phone Number 07123 456 789 07123 456 798 Favourite Sport Football Cricket
  • 22. 22Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Assurance Scoring
  • 23. 23Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Jupyter Notebook
  • 24. 24Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date Further Details matt.thomson@capgemini.com / @MattGThomson Assurance Scoring brochure: http://ow.ly/4nbEUI Blogs:  Introduction: https://www.capgemini.com/node/1380596  Integrating multiple techniques: http://bit.ly/24BmszV  Machine Learning: http://bit.ly/1QTMGnq  More coming soon!
  • 25. 25Copyright © Capgemini 2012. All Rights Reserved Presentation Title | Date We’re Hiring! Data Science https://www.uk.capgemini.com/careers/jobs/data-scientist-0 Big Data Engineer https://www.uk.capgemini.com/careers/jobs/big-data-engineer Data Visualisation Analyst https://www.uk.capgemini.com/careers/jobs/data-visualisation-analyst matt.thomson@capgemini.com
  • 26. The information contained in this presentation is proprietary. © 2012 Capgemini. All rights reserved. www.capgemini.com About Capgemini With more than 120,000 people in 40 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2011 global revenues of EUR 9.7 billion. Together with its clients, Capgemini creates and delivers business and technology solutions that fit their needs and drive the results they want. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business ExperienceTM, and draws on Rightshore ®, its worldwide delivery model. Rightshore® is a trademark belonging to Capgemini