SlideShare a Scribd company logo
Big Data Medicare Fraud
Detection
ATHARVA KOUSADIKAR
WHY ?
US Healthcare spending has
increased by 6.7 % making it $
3 trillion.
Medicare accounts for up to
$800 bn.
Fraud impact is estimated up to
10%
Workflow
Data Modelling
03
● Logistic Regression
● Gaussian Naïve Bayes
● Random Forest Classifier
● Extra Tree Classifier
● Gradient Boosting Classifier
Data Pre-processing
02
● Data Visualization/ Exploratory Data Analysis
● Data cleaning
● Feature Engineering
● Class weights Balancing
Database Selection
01
● CMS Prescriber Data 2017
● Payment Data 2017
● Excluded (LEIE) dataset
End Result
04 ● Conclusion
● Future scope
Problem
Statement
Build an innovative machine
learning model that predicts fraud
in the Medicare industry using
anomaly analysis and geo-
demographic metrics.
Fraud
Patterns
1. Fraud by Service Providers (Doctors, hospitals, pharmacies)
2. Fraud by Insurance subscribers (patient or patient’s employers)
3. Fraud by insurance carriers
4. Conspiracy Frauds (involved with all parties)
Govt.
Efforts
Government has initialized the
programs, such as the
Medicare Fraud Strike Force,
enacted to help combat fraud,
but continued efforts are
needed to better mitigate the
effects of fraud.
Insights
Tools Used:
1. Tableau
2. Power BI
3. Spark using Azure HDinsight
Population by states
NPI per State
Exclusion Count
Number of Frauds By state
Dataset Selection
CMS – Prescriber Data 2017
01
● 25M+ rows and 21 columns
● All information related to prescription, drugs,
payments and charges by National Provider
Identifier (NPI).
● All information on the physician (NPI, Name, City,
Practice, etc.)
Payments Received by
Physicians 2017
02
● 11M+ rows and 75 columns
● Physicians in the US are required to declare all
payments received from pharmaceutical companies
● The sum of general payment
● Name of drug associated with the payments
List of Excluded Individuals
and Entities (LEIE) database
2017
03
● list of individuals and entities that are excluded from
participating in federally funded healthcare
programs (i.e. Medicare) due to previous healthcare
fraud.
● Mapped fraud labels
Data Pre-Processing
Data cleaning
● Impute missing Data
● Removing duplicates
● Removing outliers
● Factoring the categorical data
● Removing data based on general information.
● Data Sampling: The data set is very imbalanced in terms of fraud detection context as it is very skewed
(99 % no fraudulent cases and less than 1% fraudulent cases)
Feature Engineering
Joining datasets based on NPI, state, city, first and last na
Drug- based Fraudulent cases
Merging drug fraudulent cases with
prescriber data to create more features
Transforming Data and class balancing
Transform skewed data to approximately conform to normality by using log transformation
Class weights assigned to reduce
skewness according to the
balancing ratio
Data Modelling
Train-Test-Split
Scaling data using Standard Scalar
Models Implemented:
• Logistic Regression
• Gaussian Naïve Bayes
• and Gradient Boosting
• Classifier
ExtraTrees
Model Evaluation
Random Classifier
Conclusion
● With the increasing number of population of over 65 in USA, Medicare Fraud Detection
is essential
● All types of Fraud Patterns have been Covered.
● Most Fraud Cases committed are in bay area
● Out of 5 Models Performed, best resulting model is Random Forest with AUC 72 %
Future Scope
• Use cross validation for sampling the data into train-test
split.
• Hyper-parameter tuning to increase the overall performance
of the algorithm.
• Build a real-time fraud detection pipeline using ML flow and
Kafka.
• The model needs to be retrained without stopping the
prediction service, since users will keep interacting.
Random Forest Model hosted using ML flow
Kafka and zookeeper server initialized
using docker
References
● Part D Prescriber Data CY 2017. (n.d.). Retrieved June 23, 2020, from
https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-
Reports/Medicare-Provider-Charge-Data/PartD2017
● LEIE Downloadable Databases: Office of Inspector General: U.S. Department of Health
and Human Services. (2020, June 10). Retrieved June 23, 2020, from
https://oig.hhs.gov/exclusions/exclusions_list.asp
● Dataset Downloads. (n.d.). Retrieved June 23, 2020, from
https://www.cms.gov/OpenPayments/Explore-the-Data/Dataset-Downloads
Thank you

More Related Content

Similar to Big Data Medicare Fraud Detection_Finance_Project (1).pptx

Data Based Intelligence
Data Based Intelligence Data Based Intelligence
Data Based Intelligence
Data Portal India
 
Christie tiegland state_veterans_homes_not_your_average_nursing_home
Christie tiegland state_veterans_homes_not_your_average_nursing_homeChristie tiegland state_veterans_homes_not_your_average_nursing_home
Christie tiegland state_veterans_homes_not_your_average_nursing_home
Shane Newman
 
Hybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer dataHybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer data
IJECEIAES
 
MyRBQM Academy | Webinar Fraud and Sloppiness Detection in Clinical Trials [P...
MyRBQM Academy | Webinar Fraud and Sloppiness Detection in Clinical Trials [P...MyRBQM Academy | Webinar Fraud and Sloppiness Detection in Clinical Trials [P...
MyRBQM Academy | Webinar Fraud and Sloppiness Detection in Clinical Trials [P...
Cyntegrity | Data Science for Clinical Trials
 
Using Linked Survey and Administrative Records Studies to Partially Correct S...
Using Linked Survey and Administrative Records Studies to Partially Correct S...Using Linked Survey and Administrative Records Studies to Partially Correct S...
Using Linked Survey and Administrative Records Studies to Partially Correct S...
soder145
 
Quantum Health Case Study
Quantum Health Case StudyQuantum Health Case Study
Quantum Health Case Study
Mark Gall
 
IRJET- Heart Disease Prediction System
IRJET- Heart Disease Prediction SystemIRJET- Heart Disease Prediction System
IRJET- Heart Disease Prediction System
IRJET Journal
 
Predictive analytics-white-paper
Predictive analytics-white-paperPredictive analytics-white-paper
Predictive analytics-white-paper
Shubhashish Biswas
 
Data Quality Matters: EHR Data Quality, MACRA, and Improving Healthcare
Data Quality Matters: EHR Data Quality, MACRA, and Improving HealthcareData Quality Matters: EHR Data Quality, MACRA, and Improving Healthcare
Data Quality Matters: EHR Data Quality, MACRA, and Improving Healthcare
Mike Hogarth, MD, FACMI, FACP
 
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning AlgorithmsIRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET Journal
 
Using Linked Survey and Administrative Records Studies to Partially Correct S...
Using Linked Survey and Administrative Records Studies to Partially Correct S...Using Linked Survey and Administrative Records Studies to Partially Correct S...
Using Linked Survey and Administrative Records Studies to Partially Correct S...
soder145
 
Predicting deaths from COVID-19 using Machine Learning
Predicting deaths from COVID-19 using Machine LearningPredicting deaths from COVID-19 using Machine Learning
Predicting deaths from COVID-19 using Machine Learning
IdanGalShohet
 
Data mining applications
Data mining applicationsData mining applications
Data mining applications
Francisco E. Figueroa-Nigaglioni
 
Detecting health insurance fraud using analytics
Detecting health insurance fraud using analytics Detecting health insurance fraud using analytics
Detecting health insurance fraud using analytics
Nitin Verma
 
Healthcare Analytics Adoption Model -- Updated
Healthcare Analytics Adoption Model -- UpdatedHealthcare Analytics Adoption Model -- Updated
Healthcare Analytics Adoption Model -- Updated
Health Catalyst
 
BRG Albany HFMA 4.23.15
BRG Albany HFMA 4.23.15BRG Albany HFMA 4.23.15
BRG Albany HFMA 4.23.15
Mark Driscoll
 
Using the SNACC Linking Project to Impute Medicaid in the Current Population ...
Using the SNACC Linking Project to Impute Medicaid in the Current Population ...Using the SNACC Linking Project to Impute Medicaid in the Current Population ...
Using the SNACC Linking Project to Impute Medicaid in the Current Population ...
soder145
 
Disrupting the Oncology Care Continuum through AI and Advanced Analytics
Disrupting the Oncology Care Continuum through AI and Advanced AnalyticsDisrupting the Oncology Care Continuum through AI and Advanced Analytics
Disrupting the Oncology Care Continuum through AI and Advanced Analytics
Michael Peters
 
Big Data in Medicine
Big Data in MedicineBig Data in Medicine
Big Data in Medicine
Nasir Arafat
 
Big Data Risks and Rewards (good length and at least 3-4 references .docx
Big Data Risks and Rewards (good length and at least 3-4 references .docxBig Data Risks and Rewards (good length and at least 3-4 references .docx
Big Data Risks and Rewards (good length and at least 3-4 references .docx
tangyechloe
 

Similar to Big Data Medicare Fraud Detection_Finance_Project (1).pptx (20)

Data Based Intelligence
Data Based Intelligence Data Based Intelligence
Data Based Intelligence
 
Christie tiegland state_veterans_homes_not_your_average_nursing_home
Christie tiegland state_veterans_homes_not_your_average_nursing_homeChristie tiegland state_veterans_homes_not_your_average_nursing_home
Christie tiegland state_veterans_homes_not_your_average_nursing_home
 
Hybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer dataHybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer data
 
MyRBQM Academy | Webinar Fraud and Sloppiness Detection in Clinical Trials [P...
MyRBQM Academy | Webinar Fraud and Sloppiness Detection in Clinical Trials [P...MyRBQM Academy | Webinar Fraud and Sloppiness Detection in Clinical Trials [P...
MyRBQM Academy | Webinar Fraud and Sloppiness Detection in Clinical Trials [P...
 
Using Linked Survey and Administrative Records Studies to Partially Correct S...
Using Linked Survey and Administrative Records Studies to Partially Correct S...Using Linked Survey and Administrative Records Studies to Partially Correct S...
Using Linked Survey and Administrative Records Studies to Partially Correct S...
 
Quantum Health Case Study
Quantum Health Case StudyQuantum Health Case Study
Quantum Health Case Study
 
IRJET- Heart Disease Prediction System
IRJET- Heart Disease Prediction SystemIRJET- Heart Disease Prediction System
IRJET- Heart Disease Prediction System
 
Predictive analytics-white-paper
Predictive analytics-white-paperPredictive analytics-white-paper
Predictive analytics-white-paper
 
Data Quality Matters: EHR Data Quality, MACRA, and Improving Healthcare
Data Quality Matters: EHR Data Quality, MACRA, and Improving HealthcareData Quality Matters: EHR Data Quality, MACRA, and Improving Healthcare
Data Quality Matters: EHR Data Quality, MACRA, and Improving Healthcare
 
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning AlgorithmsIRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
 
Using Linked Survey and Administrative Records Studies to Partially Correct S...
Using Linked Survey and Administrative Records Studies to Partially Correct S...Using Linked Survey and Administrative Records Studies to Partially Correct S...
Using Linked Survey and Administrative Records Studies to Partially Correct S...
 
Predicting deaths from COVID-19 using Machine Learning
Predicting deaths from COVID-19 using Machine LearningPredicting deaths from COVID-19 using Machine Learning
Predicting deaths from COVID-19 using Machine Learning
 
Data mining applications
Data mining applicationsData mining applications
Data mining applications
 
Detecting health insurance fraud using analytics
Detecting health insurance fraud using analytics Detecting health insurance fraud using analytics
Detecting health insurance fraud using analytics
 
Healthcare Analytics Adoption Model -- Updated
Healthcare Analytics Adoption Model -- UpdatedHealthcare Analytics Adoption Model -- Updated
Healthcare Analytics Adoption Model -- Updated
 
BRG Albany HFMA 4.23.15
BRG Albany HFMA 4.23.15BRG Albany HFMA 4.23.15
BRG Albany HFMA 4.23.15
 
Using the SNACC Linking Project to Impute Medicaid in the Current Population ...
Using the SNACC Linking Project to Impute Medicaid in the Current Population ...Using the SNACC Linking Project to Impute Medicaid in the Current Population ...
Using the SNACC Linking Project to Impute Medicaid in the Current Population ...
 
Disrupting the Oncology Care Continuum through AI and Advanced Analytics
Disrupting the Oncology Care Continuum through AI and Advanced AnalyticsDisrupting the Oncology Care Continuum through AI and Advanced Analytics
Disrupting the Oncology Care Continuum through AI and Advanced Analytics
 
Big Data in Medicine
Big Data in MedicineBig Data in Medicine
Big Data in Medicine
 
Big Data Risks and Rewards (good length and at least 3-4 references .docx
Big Data Risks and Rewards (good length and at least 3-4 references .docxBig Data Risks and Rewards (good length and at least 3-4 references .docx
Big Data Risks and Rewards (good length and at least 3-4 references .docx
 

Recently uploaded

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 

Recently uploaded (20)

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 

Big Data Medicare Fraud Detection_Finance_Project (1).pptx

  • 1. Big Data Medicare Fraud Detection ATHARVA KOUSADIKAR
  • 2. WHY ? US Healthcare spending has increased by 6.7 % making it $ 3 trillion. Medicare accounts for up to $800 bn. Fraud impact is estimated up to 10%
  • 3. Workflow Data Modelling 03 ● Logistic Regression ● Gaussian Naïve Bayes ● Random Forest Classifier ● Extra Tree Classifier ● Gradient Boosting Classifier Data Pre-processing 02 ● Data Visualization/ Exploratory Data Analysis ● Data cleaning ● Feature Engineering ● Class weights Balancing Database Selection 01 ● CMS Prescriber Data 2017 ● Payment Data 2017 ● Excluded (LEIE) dataset End Result 04 ● Conclusion ● Future scope
  • 4. Problem Statement Build an innovative machine learning model that predicts fraud in the Medicare industry using anomaly analysis and geo- demographic metrics.
  • 5. Fraud Patterns 1. Fraud by Service Providers (Doctors, hospitals, pharmacies) 2. Fraud by Insurance subscribers (patient or patient’s employers) 3. Fraud by insurance carriers 4. Conspiracy Frauds (involved with all parties)
  • 6. Govt. Efforts Government has initialized the programs, such as the Medicare Fraud Strike Force, enacted to help combat fraud, but continued efforts are needed to better mitigate the effects of fraud.
  • 7.
  • 8. Insights Tools Used: 1. Tableau 2. Power BI 3. Spark using Azure HDinsight
  • 12. Number of Frauds By state
  • 13.
  • 14.
  • 15. Dataset Selection CMS – Prescriber Data 2017 01 ● 25M+ rows and 21 columns ● All information related to prescription, drugs, payments and charges by National Provider Identifier (NPI). ● All information on the physician (NPI, Name, City, Practice, etc.) Payments Received by Physicians 2017 02 ● 11M+ rows and 75 columns ● Physicians in the US are required to declare all payments received from pharmaceutical companies ● The sum of general payment ● Name of drug associated with the payments List of Excluded Individuals and Entities (LEIE) database 2017 03 ● list of individuals and entities that are excluded from participating in federally funded healthcare programs (i.e. Medicare) due to previous healthcare fraud. ● Mapped fraud labels
  • 16. Data Pre-Processing Data cleaning ● Impute missing Data ● Removing duplicates ● Removing outliers ● Factoring the categorical data ● Removing data based on general information. ● Data Sampling: The data set is very imbalanced in terms of fraud detection context as it is very skewed (99 % no fraudulent cases and less than 1% fraudulent cases)
  • 17. Feature Engineering Joining datasets based on NPI, state, city, first and last na
  • 18. Drug- based Fraudulent cases Merging drug fraudulent cases with prescriber data to create more features
  • 19. Transforming Data and class balancing Transform skewed data to approximately conform to normality by using log transformation Class weights assigned to reduce skewness according to the balancing ratio
  • 20. Data Modelling Train-Test-Split Scaling data using Standard Scalar Models Implemented: • Logistic Regression • Gaussian Naïve Bayes • and Gradient Boosting • Classifier ExtraTrees
  • 22. Conclusion ● With the increasing number of population of over 65 in USA, Medicare Fraud Detection is essential ● All types of Fraud Patterns have been Covered. ● Most Fraud Cases committed are in bay area ● Out of 5 Models Performed, best resulting model is Random Forest with AUC 72 %
  • 23. Future Scope • Use cross validation for sampling the data into train-test split. • Hyper-parameter tuning to increase the overall performance of the algorithm. • Build a real-time fraud detection pipeline using ML flow and Kafka. • The model needs to be retrained without stopping the prediction service, since users will keep interacting. Random Forest Model hosted using ML flow Kafka and zookeeper server initialized using docker
  • 24. References ● Part D Prescriber Data CY 2017. (n.d.). Retrieved June 23, 2020, from https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and- Reports/Medicare-Provider-Charge-Data/PartD2017 ● LEIE Downloadable Databases: Office of Inspector General: U.S. Department of Health and Human Services. (2020, June 10). Retrieved June 23, 2020, from https://oig.hhs.gov/exclusions/exclusions_list.asp ● Dataset Downloads. (n.d.). Retrieved June 23, 2020, from https://www.cms.gov/OpenPayments/Explore-the-Data/Dataset-Downloads