Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analyzing Breast Cancer Dataset with Azure Machine Learning Studio

465 views

Published on

This presentation was given by https://www.meetup.com/Chicago-Technology-For-Value-Based-Healthcare-Meetup/ Member Frank Mendoza of Catalytics on January 23, 2018

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Analyzing Breast Cancer Dataset with Azure Machine Learning Studio

  1. 1. 2018 Catalytics, LLC - Proprietary and Confidential Analyzing Breast Cancer Dataset with Azure Machine Learning (ML) Studio Frank Mendoza CEO, Catalytics Chicago Technology for Value-Based Healthcare Meetup January 23, 2018
  2. 2. 2018 Catalytics, LLC - Proprietary and Confidential • Total of 569 records in dataset – donated in 1995 • 30 distinct numerical attributes (or features) associated with each record • No categorical features available within the dataset Breast Cancer Wisconsin (Diagnostic) Dataset Description Location: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
  3. 3. 2018 Catalytics, LLC - Proprietary and Confidential Breast Cancer Wisconsin (Diagnostic) Dataset Description, cont. • Column identified as “Diagnosis” is the dataset label • M = malignant • B = benign 300+ 200+ Example of Measurements
  4. 4. 2018 Catalytics, LLC - Proprietary and Confidential Core Steps to build Predictive Models using Machine Learning 5(a) Test API
  5. 5. 2018 Catalytics, LLC - Proprietary and Confidential Acquire Data & Prepare • Dataset did not have any missing values • Manipulation was still required to ensure training process would be successful – normalization, etc. • Split data into two sets to Train & Test model • Training = 311 records (~54%) • Testing set 1 = 208 records (~36%) • Additional Testing set was to test model after API created – step 5(a) • Testing set 2 = 50 records (~10%) • Training & Testing set 1 was uploaded to Azure Machine Learning (ML) Studio
  6. 6. 2018 Catalytics, LLC - Proprietary and Confidential Training Predictive Model Choosing algorithms • Since label is 2 class – Benign vs. Malignant; it was clear that a Classification model would be necessary • Multiple models were developed to identify the best algorithm to use • Two class Logistic Regression • Two class Support Vector Machine • Two class Boosted Decision Tree • Two class Neural Network - WINNER
  7. 7. 2018 Catalytics, LLC - Proprietary and Confidential Optimizing Neural Network Model • Feature Selection – identify which attributes matter Important Less Important
  8. 8. 2018 Catalytics, LLC - Proprietary and Confidential Feature Selection, continued • Azure ML contains a module called “Permutation Feature Importance” that will test features to identify importance
  9. 9. 2018 Catalytics, LLC - Proprietary and Confidential Cross Validation • Azure ML contains a module called “Cross Validation Model” that will evaluate model by partitioning the data – used to ensure that model will perform against unseen/ new data 10 folds
  10. 10. 2018 Catalytics, LLC - Proprietary and Confidential Neural Network Classification Model Optimized • Feature selection allowed us to remove 14 attributes that did not contribute to improving model • Accuracy improved from 0.976 to 0.981
  11. 11. AZURE ML DEMONSTRATION
  12. 12. AZURE ML API/ EXCEL DEMONSTRATION
  13. 13. 2018 Catalytics, LLC - Proprietary and Confidential Frank Mendoza, CEO & Chief Catalyst 900 E. Pecan St, Suite 300-286 Pflugerville, TX 78660-8048 Phone: +1 (512) 767-8604 Fax: +1 (737) 703-5478 Email: Frank@CatalyticsConsulting.com linkedin.com/in/fxmendoza Twitter: @DataDrivenMind
  14. 14. Appendix
  15. 15. 2018 Catalytics, LLC - Proprietary and Confidential Attribute Information 1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32) Ten real-valued features are computed for each cell nucleus: a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1) Location: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.707&rep=rep1&type=pdf

×