This project aimed to build a predictive model for detecting healthcare fraud using Medicare and payment data from open sources. The data was stored in a Hadoop cluster and preprocessed using Pig scripts. Python was used to implement a data pipeline with scaling and different machine learning models, including logistic regression, Gaussian, random forest, extra trees, and gradient boosting classifiers. The random forest model achieved the best average AUC of 0.69. Feature importance analysis found the custom "drug_score" feature was most important. Future work could include more data sources, additional features, and model blending to improve results.