Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Gender Prediction with
Databricks AutoML Pipeline
Sharon Xu
Senior Advisor, Advanced Analytics at AARP
Qing Sun
Resident S...
Sharon Xu, PhD
§ Senior Advanced Analytics Advisor, AARP
§ Focus on Market Targeting with Machine
Learning, Incremental an...
Agenda
§ Background
▪ The scope of AARP’s predictive models,
business need of Gender Model
§ Gender Model Approaches
▪ Dat...
AARP (formerly called the American Association of
Retired Persons) has about 38 million members. It has
been championing p...
Background - Scope of AARP’s Predictive Models
▪ DM
▪ Alt Media
▪ Lead Gen
▪ Renewal Propensity
▪ Transition to Auto
Renew...
Why Gender Model?
• In AARP’s targeting audience universe, there are about 2 Million missing gender
information.
• Less ac...
Random Forest Classifier to identify gender using the existing first names,
ages, and many variables derived from alphabet...
Approach
A total of 731 model variables were derived
using First Name, these 731 variables can
be divided into seven diffe...
Logistic
Regression
Performance
• Various ML models were tested and RF was the final winner.
• The model displays 76% accu...
AutoML Pipeline for Gender Model
AutoML Pipeline Overview
DATA LAKE
Pull name/gender/age,
data cleansing and
aggregation
Create 700+
features and
save as D...
Selective code demo
• Feature Engineering (Notebook)
• Modeling
• MLflow and Model Registry
• Job Schedule (scoring)
Learnings and Future
§ Fast - Less worries on large dataset and tuning
iterations with customized cluster and
parallelism ...
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Gender Prediction with Databricks AutoML Pipeline
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Gender Prediction with Databricks AutoML Pipeline

Download to read offline

As the nation’s leading advocate for people aged 50+, each month AARP conducts thousands of campaigns made up of hundreds of millions of emails, mails and phone calls to over 37 million members and a broader universe of non-members. Missing information on demographics results in less accurate profiling and targeting strategies. For example, there are 1.5 Million active members and 15 Million expired members missing gender information in AARP’s database. The Name gender model is a use case where AARP Data Analytics team utilized the Databricks Lakehouse platform to create a fully automated machine learning model. The Random Forest Classifier used 800 thousand existing distinct first names, ages, and over 700 variables derived from the letter composition of first names to predict gender. It leveraged MLflow to track the accuracy of models and log the metrics overtime, and registered multiple model versions to pick the best for production. Model training and scoring were scheduled and the auto ML pipeline significantly minimized manual working hours after the initial set up. As a result, AARP dramatically (and accurately) improved the coverage of gender information from about 92.5% to 99.5%.

  • Be the first to like this

Gender Prediction with Databricks AutoML Pipeline

  1. 1. Gender Prediction with Databricks AutoML Pipeline Sharon Xu Senior Advisor, Advanced Analytics at AARP Qing Sun Resident Solutions Architect, Databricks
  2. 2. Sharon Xu, PhD § Senior Advanced Analytics Advisor, AARP § Focus on Market Targeting with Machine Learning, Incremental and Attribution Analysis, and Digital Behavioral Analytics § Graduated from University of Maryland, PhD in Civil Engineering (transportation demand forecasting) § Quite busy in gardening during pandemic Qing Sun § Resident Solutions Architect, Databricks § Data Scientist and Machine Learning SME for Public Sector Professional Services § Thought about climbing Everest
  3. 3. Agenda § Background ▪ The scope of AARP’s predictive models, business need of Gender Model § Gender Model Approaches ▪ Data used and derived variables, methodology, and performance § AutoML Pipeline for Gender Model § Learnings and Future
  4. 4. AARP (formerly called the American Association of Retired Persons) has about 38 million members. It has been championing positive social change and delivering value to our 50 plus members and empowering people to choose how they live as they age for decades.
  5. 5. Background - Scope of AARP’s Predictive Models ▪ DM ▪ Alt Media ▪ Lead Gen ▪ Renewal Propensity ▪ Transition to Auto Renew ▪ Live Answer ▪ Email Click ▪ Online (Gmail, Facebook, Display responder, etc.) • Channel Preference Model • Membership Joins or Renews Model ▪ Advocacy ▪ Foundation ▪ Event Interest ▪ Online Program Interest • Member Interest Model ▪ Demographics (e.g. diversity indicator) ▪ Imputation (e.g. gender, third party vendor data) • Member Information Model
  6. 6. Why Gender Model? • In AARP’s targeting audience universe, there are about 2 Million missing gender information. • Less accurate profiling and targeting strageties were conducted due to the lack of gender information. • Will be a simple use case to test the functions of MLflow, Model Registry, and automate the entire ML process.
  7. 7. Random Forest Classifier to identify gender using the existing first names, ages, and many variables derived from alphabetical order of first names. Gender Model Approaches
  8. 8. Approach A total of 731 model variables were derived using First Name, these 731 variables can be divided into seven different categories: 1. Frequency of Letters (26 variables) 2. Sum of Position of Letters (26 variables) 3. Frequency of Bigrams (26x26 variables) 4. Second-To-Last Character Position [SLC] 5. Last Character Position [LC] 6. Length of Name [LoN] 7. Average Age [AA] FREQUENCY OF LETTERS a b c d e f g h i j k l m n ••• S h a r o n 1 - - - - - - 1 - - - - - 1 -- Q i n g - - - - - - 1 - 1 - - - - 1 -- SUM OF POSITION OF LETTERS a b c d e f g h i j k l m n ••• S h a r o n 3 - - - - - - 2 - - - - - 6 -- Q i n g - - - - - - 4 - 2 - - - - 3 -- 1 2 3 4 5 6 1 2 3 4 FREQUENCY OF BIGRAMS aa ab ac ••• aq ar ••• ba bb ••• im in io ••• zz S h a r o n - - - – - 1 – - - – - - - – - Q i n g - - - – - - – - - – - 1 - – - OTHER VARIABLES SLC LC LoN AA S h a r o n 15 14 6 38 Q i n g 14 7 4 35 15 14 14 7
  9. 9. Logistic Regression Performance • Various ML models were tested and RF was the final winner. • The model displays 76% accuracy of gender prediction on purely new distinct names (not existing in current model dataset). • The model displays 90% accuracy of gender prediction on a mixed name list (with and without names in modeling dataset). • The model can only be used to predict binary gender information due to the lack of non-binary data. Random Forest Classifier Support Vector Machines Naïve Bayes Classifier Multilayer Perceptron
  10. 10. AutoML Pipeline for Gender Model
  11. 11. AutoML Pipeline Overview DATA LAKE Pull name/gender/age, data cleansing and aggregation Create 700+ features and save as Delta table Build and train ML pipeline, Parameter tuning, Pick the best model MLflow Tracking: log tuning matrix and evaluation matrix, best Model Registry Model deployment: load model from Model Registry, scoring Schedule jobs: Train/ Score
  12. 12. Selective code demo • Feature Engineering (Notebook)
  13. 13. • Modeling
  14. 14. • MLflow and Model Registry
  15. 15. • Job Schedule (scoring)
  16. 16. Learnings and Future § Fast - Less worries on large dataset and tuning iterations with customized cluster and parallelism setting; storing data in Delta format speed up the pipeline; § Model Management - MLflow easily tracks all model runs’ results; Model Registry simplifies model management job; § Self-training and scoring – Job Schedule completes model retraining and scoring regularly without human intervention. § Integrated data platform to minimize flat file intake or avoid moving modeling dataset across different platforms; § To keep using MLflow and Model Registry to well organize model results and storage; § To Streamline the entire ML pipeline to have the structure setup only once and schedule the retrain/score regularly with full automation… To ensure the future ML pipeline automated (as much as possible), we need: Then why do we need modelers? …
  17. 17. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

As the nation’s leading advocate for people aged 50+, each month AARP conducts thousands of campaigns made up of hundreds of millions of emails, mails and phone calls to over 37 million members and a broader universe of non-members. Missing information on demographics results in less accurate profiling and targeting strategies. For example, there are 1.5 Million active members and 15 Million expired members missing gender information in AARP’s database. The Name gender model is a use case where AARP Data Analytics team utilized the Databricks Lakehouse platform to create a fully automated machine learning model. The Random Forest Classifier used 800 thousand existing distinct first names, ages, and over 700 variables derived from the letter composition of first names to predict gender. It leveraged MLflow to track the accuracy of models and log the metrics overtime, and registered multiple model versions to pick the best for production. Model training and scoring were scheduled and the auto ML pipeline significantly minimized manual working hours after the initial set up. As a result, AARP dramatically (and accurately) improved the coverage of gender information from about 92.5% to 99.5%.

Views

Total views

57

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

2

Shares

0

Comments

0

Likes

0

×