SlideShare a Scribd company logo
Big Data Project
Medicare Fraud Detection
Using Open Source Data
Xinyu (Max) Liu
xinyulrsm@gmail.com
Project link:
http://nbviewer.jupyter.org/github/maxliu/health_care/blob/master
/HealthCare_fraud_detection.ipynb
Project overview
• Goal: Build a predictive model for detecting health care fraud
• Data source
• Part-D and Payment data: cms.gov
• NPI exclusion data: hhs.gov
• Technology
• Hadoop/Yarn/Spark/Hive: Part_D and Payment data are stored in a five-node cluster.
• Pig: Pig scripts were written for cleaning data and merging tables.
• Python/SciKit-learn: Implemented a data processing pipeline including scaling and
modeling. t-test is used to select drugs on training set for feature extraction.
• Algorithm and result
• Five different models: Logistic regression, Gaussian, Random Forest Classifier, Extra
Trees Classifier, Gradient Boosting Classifier
• Average AUC: 0.69 (Random Forest) 2
Fraudulence in health care industry
3
Year People Charged False Billings (Million)
2010 94 $251
2011 91 $295
2012 2 $1.9
2013 89 $223
2014
2015 243 $712
Data source: en.Wikipedia.org/wiki/Medicare_fraud
Date storage: 5-node Hadoop/yarn/spark cluster
4
Node 1
2 Core
4GB Memory
Node 3
2 Core
4GB Memory
Node 3
1 Core
4GB Memory
Node 4
2 Core
8GB Memory
Node 5
4 Core
22GB Memory
Copy files to Hadoop then move to Hive warehouse
5
Pig scripts for cleaning data and merging tables
6
7
Group part_D data by NPI and drug name for future analysis
8
Group Payment data by NPI for future analysis
NPI exclusions database used as target
9
Merged exclusion data to the table as “is_fraud” column
The data is highly unbalanced.
• Only 427 positive samples in total 808076 records (0.05%).
• Possible solutions
• Down sampling
• Up sampling
• Give higher weight topositive samples for model training ( used in this
project)
10
t-test: Select the drug feature only when P <= 0.05
11
No YES
Note that this feature selection was only applied on training set to avoid data leakage.
NoAn example for
“CLOTRIMAZOLE”
Five different models used:
12
Logistic regression, Gaussian, Random Forest Classifier, Extra
Trees Classifier, Gradient Boosting Classifier
Feature importance analysis (from Random Forest)
Findings
• The “drug_score” feature created
become the most important feature
• The payment turned out to be
trivial or negligible.
13
AUC (from Random Forest)
14
• Average AUC ~ 0.685 (> 0.5)
• This means the predictive model I
built can provide useful information
on health care fraud detection.
Future work for improvement
● Include more data (e.g., the Part-B dataset)
● Add additional feature (e.g., Page rank)
● Blend multiple model
15
References
● https://www.versustexas.com/criminal/healthcare-medicare-medicaid-fraud/
PageRank for Anomaly Detection,
By Ofer Mendelevitch and Jiwon Seo
● http://www.dataiku.com/blog/2015/08/12/Medicare_Fraud.html
By Pierre Gutierrez @ Dataiku
16

More Related Content

What's hot

What is Data analytics and it's importance ?
What is Data analytics and it's importance ?What is Data analytics and it's importance ?
What is Data analytics and it's importance ?
AbhayDhupar
 
Power Analysis for Beginners
Power Analysis for BeginnersPower Analysis for Beginners
Power Analysis for Beginnersgfb1
 
Constructing Query-Driven Dynamic Machine Learning Model With Application to ...
Constructing Query-Driven Dynamic Machine Learning Model With Application to ...Constructing Query-Driven Dynamic Machine Learning Model With Application to ...
Constructing Query-Driven Dynamic Machine Learning Model With Application to ...
I3E Technologies
 
Comparative Study of Classification Method on Customer Candidate Data to Pred...
Comparative Study of Classification Method on Customer Candidate Data to Pred...Comparative Study of Classification Method on Customer Candidate Data to Pred...
Comparative Study of Classification Method on Customer Candidate Data to Pred...
IJECEIAES
 
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
Paul Richards
 
File 498 Doc 4 01 Dm Intro To Dm
File 498 Doc 4 01 Dm Intro To DmFile 498 Doc 4 01 Dm Intro To Dm
File 498 Doc 4 01 Dm Intro To Dmmupa
 
4 visualization inter
4 visualization inter4 visualization inter
4 visualization inter
FEG
 
Understanding Statistical Power for Non-Statisticians
Understanding Statistical Power for Non-StatisticiansUnderstanding Statistical Power for Non-Statisticians
Understanding Statistical Power for Non-Statisticians
Statistics & Data Corporation
 
The XNAT imaging informatics platform
The XNAT imaging informatics platformThe XNAT imaging informatics platform
The XNAT imaging informatics platform
imgcommcall
 
Imputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trialsImputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trials
Nitin George
 
Introduction to Modeling
Introduction to ModelingIntroduction to Modeling
Introduction to Modeling
JMP software from SAS
 
RapidMiner, an entrance to explore MIMIC-III?
RapidMiner, an entrance to explore MIMIC-III?RapidMiner, an entrance to explore MIMIC-III?
RapidMiner, an entrance to explore MIMIC-III?
Sven Van Poucke, MD, PhD
 

What's hot (12)

What is Data analytics and it's importance ?
What is Data analytics and it's importance ?What is Data analytics and it's importance ?
What is Data analytics and it's importance ?
 
Power Analysis for Beginners
Power Analysis for BeginnersPower Analysis for Beginners
Power Analysis for Beginners
 
Constructing Query-Driven Dynamic Machine Learning Model With Application to ...
Constructing Query-Driven Dynamic Machine Learning Model With Application to ...Constructing Query-Driven Dynamic Machine Learning Model With Application to ...
Constructing Query-Driven Dynamic Machine Learning Model With Application to ...
 
Comparative Study of Classification Method on Customer Candidate Data to Pred...
Comparative Study of Classification Method on Customer Candidate Data to Pred...Comparative Study of Classification Method on Customer Candidate Data to Pred...
Comparative Study of Classification Method on Customer Candidate Data to Pred...
 
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
SheffieldR July Meeting - Multiple Imputation with Chained Equations (MICE) p...
 
File 498 Doc 4 01 Dm Intro To Dm
File 498 Doc 4 01 Dm Intro To DmFile 498 Doc 4 01 Dm Intro To Dm
File 498 Doc 4 01 Dm Intro To Dm
 
4 visualization inter
4 visualization inter4 visualization inter
4 visualization inter
 
Understanding Statistical Power for Non-Statisticians
Understanding Statistical Power for Non-StatisticiansUnderstanding Statistical Power for Non-Statisticians
Understanding Statistical Power for Non-Statisticians
 
The XNAT imaging informatics platform
The XNAT imaging informatics platformThe XNAT imaging informatics platform
The XNAT imaging informatics platform
 
Imputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trialsImputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trials
 
Introduction to Modeling
Introduction to ModelingIntroduction to Modeling
Introduction to Modeling
 
RapidMiner, an entrance to explore MIMIC-III?
RapidMiner, an entrance to explore MIMIC-III?RapidMiner, an entrance to explore MIMIC-III?
RapidMiner, an entrance to explore MIMIC-III?
 

Similar to Medicare fraud detection

Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
IRJET- Disease Prediction System
IRJET- Disease Prediction SystemIRJET- Disease Prediction System
IRJET- Disease Prediction System
IRJET Journal
 
IRJET- Analyse Big Data Electronic Health Records Database using Hadoop Cluster
IRJET- Analyse Big Data Electronic Health Records Database using Hadoop ClusterIRJET- Analyse Big Data Electronic Health Records Database using Hadoop Cluster
IRJET- Analyse Big Data Electronic Health Records Database using Hadoop Cluster
IRJET Journal
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MINING
Ashish Salve
 
IRJET- Medical Data Mining
IRJET- Medical Data MiningIRJET- Medical Data Mining
IRJET- Medical Data Mining
IRJET Journal
 
An efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysisAn efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysis
journalBEEI
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Sri Ambati
 
Hybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer dataHybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer data
IJECEIAES
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment system
KOYELMAJUMDAR1
 
IRJET- Review on Knowledge Discovery and Analysis in Healthcare using Dat...
IRJET-  	  Review on Knowledge Discovery and Analysis in Healthcare using Dat...IRJET-  	  Review on Knowledge Discovery and Analysis in Healthcare using Dat...
IRJET- Review on Knowledge Discovery and Analysis in Healthcare using Dat...
IRJET Journal
 
Hybrid prediction model with missing value imputation for medical data 2015-g...
Hybrid prediction model with missing value imputation for medical data 2015-g...Hybrid prediction model with missing value imputation for medical data 2015-g...
Hybrid prediction model with missing value imputation for medical data 2015-g...
Jitender Grover
 
ICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining ApproachICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining Approach
csandit
 
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
cscpconf
 
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
IJECEIAES
 
Final_Presentation.pptx
Final_Presentation.pptxFinal_Presentation.pptx
Final_Presentation.pptx
SudeekshaKoricherla
 
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Bigfinite
 
Predicting diabetes using a machine learning approach linked in
Predicting diabetes using a machine learning approach   linked inPredicting diabetes using a machine learning approach   linked in
Predicting diabetes using a machine learning approach linked in
venkatvajradhar1
 
Data Mining in Health Care
Data Mining in Health CareData Mining in Health Care
Data Mining in Health Care
ShahDhruv21
 
MULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTION
MULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTIONMULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTION
MULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTION
IJDKP
 
IRJET- Hybrid Architecture of Heart Disease Prediction System using Genetic N...
IRJET- Hybrid Architecture of Heart Disease Prediction System using Genetic N...IRJET- Hybrid Architecture of Heart Disease Prediction System using Genetic N...
IRJET- Hybrid Architecture of Heart Disease Prediction System using Genetic N...
IRJET Journal
 

Similar to Medicare fraud detection (20)

Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
IRJET- Disease Prediction System
IRJET- Disease Prediction SystemIRJET- Disease Prediction System
IRJET- Disease Prediction System
 
IRJET- Analyse Big Data Electronic Health Records Database using Hadoop Cluster
IRJET- Analyse Big Data Electronic Health Records Database using Hadoop ClusterIRJET- Analyse Big Data Electronic Health Records Database using Hadoop Cluster
IRJET- Analyse Big Data Electronic Health Records Database using Hadoop Cluster
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MINING
 
IRJET- Medical Data Mining
IRJET- Medical Data MiningIRJET- Medical Data Mining
IRJET- Medical Data Mining
 
An efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysisAn efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysis
 
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford MedMachine Learning in Modern Medicine with Erin LeDell at Stanford Med
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
 
Hybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer dataHybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer data
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment system
 
IRJET- Review on Knowledge Discovery and Analysis in Healthcare using Dat...
IRJET-  	  Review on Knowledge Discovery and Analysis in Healthcare using Dat...IRJET-  	  Review on Knowledge Discovery and Analysis in Healthcare using Dat...
IRJET- Review on Knowledge Discovery and Analysis in Healthcare using Dat...
 
Hybrid prediction model with missing value imputation for medical data 2015-g...
Hybrid prediction model with missing value imputation for medical data 2015-g...Hybrid prediction model with missing value imputation for medical data 2015-g...
Hybrid prediction model with missing value imputation for medical data 2015-g...
 
ICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining ApproachICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining Approach
 
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
 
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
Automatic missing value imputation for cleaning phase of diabetic’s readmissi...
 
Final_Presentation.pptx
Final_Presentation.pptxFinal_Presentation.pptx
Final_Presentation.pptx
 
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
 
Predicting diabetes using a machine learning approach linked in
Predicting diabetes using a machine learning approach   linked inPredicting diabetes using a machine learning approach   linked in
Predicting diabetes using a machine learning approach linked in
 
Data Mining in Health Care
Data Mining in Health CareData Mining in Health Care
Data Mining in Health Care
 
MULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTION
MULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTIONMULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTION
MULTI MODEL DATA MINING APPROACH FOR HEART FAILURE PREDICTION
 
IRJET- Hybrid Architecture of Heart Disease Prediction System using Genetic N...
IRJET- Hybrid Architecture of Heart Disease Prediction System using Genetic N...IRJET- Hybrid Architecture of Heart Disease Prediction System using Genetic N...
IRJET- Hybrid Architecture of Heart Disease Prediction System using Genetic N...
 

Recently uploaded

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 

Recently uploaded (20)

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 

Medicare fraud detection

  • 1. Big Data Project Medicare Fraud Detection Using Open Source Data Xinyu (Max) Liu xinyulrsm@gmail.com Project link: http://nbviewer.jupyter.org/github/maxliu/health_care/blob/master /HealthCare_fraud_detection.ipynb
  • 2. Project overview • Goal: Build a predictive model for detecting health care fraud • Data source • Part-D and Payment data: cms.gov • NPI exclusion data: hhs.gov • Technology • Hadoop/Yarn/Spark/Hive: Part_D and Payment data are stored in a five-node cluster. • Pig: Pig scripts were written for cleaning data and merging tables. • Python/SciKit-learn: Implemented a data processing pipeline including scaling and modeling. t-test is used to select drugs on training set for feature extraction. • Algorithm and result • Five different models: Logistic regression, Gaussian, Random Forest Classifier, Extra Trees Classifier, Gradient Boosting Classifier • Average AUC: 0.69 (Random Forest) 2
  • 3. Fraudulence in health care industry 3 Year People Charged False Billings (Million) 2010 94 $251 2011 91 $295 2012 2 $1.9 2013 89 $223 2014 2015 243 $712 Data source: en.Wikipedia.org/wiki/Medicare_fraud
  • 4. Date storage: 5-node Hadoop/yarn/spark cluster 4 Node 1 2 Core 4GB Memory Node 3 2 Core 4GB Memory Node 3 1 Core 4GB Memory Node 4 2 Core 8GB Memory Node 5 4 Core 22GB Memory
  • 5. Copy files to Hadoop then move to Hive warehouse 5
  • 6. Pig scripts for cleaning data and merging tables 6
  • 7. 7 Group part_D data by NPI and drug name for future analysis
  • 8. 8 Group Payment data by NPI for future analysis
  • 9. NPI exclusions database used as target 9 Merged exclusion data to the table as “is_fraud” column
  • 10. The data is highly unbalanced. • Only 427 positive samples in total 808076 records (0.05%). • Possible solutions • Down sampling • Up sampling • Give higher weight topositive samples for model training ( used in this project) 10
  • 11. t-test: Select the drug feature only when P <= 0.05 11 No YES Note that this feature selection was only applied on training set to avoid data leakage. NoAn example for “CLOTRIMAZOLE”
  • 12. Five different models used: 12 Logistic regression, Gaussian, Random Forest Classifier, Extra Trees Classifier, Gradient Boosting Classifier
  • 13. Feature importance analysis (from Random Forest) Findings • The “drug_score” feature created become the most important feature • The payment turned out to be trivial or negligible. 13
  • 14. AUC (from Random Forest) 14 • Average AUC ~ 0.685 (> 0.5) • This means the predictive model I built can provide useful information on health care fraud detection.
  • 15. Future work for improvement ● Include more data (e.g., the Part-B dataset) ● Add additional feature (e.g., Page rank) ● Blend multiple model 15
  • 16. References ● https://www.versustexas.com/criminal/healthcare-medicare-medicaid-fraud/ PageRank for Anomaly Detection, By Ofer Mendelevitch and Jiwon Seo ● http://www.dataiku.com/blog/2015/08/12/Medicare_Fraud.html By Pierre Gutierrez @ Dataiku 16