As the nation’s leading advocate for people aged 50+, each month AARP conducts thousands of campaigns made up of hundreds of millions of emails, mails and phone calls to over 37 million members and a broader universe of non-members. Missing information on demographics results in less accurate profiling and targeting strategies. For example, there are 1.5 Million active members and 15 Million expired members missing gender information in AARP’s database. The Name gender model is a use case where AARP Data Analytics team utilized the Databricks Lakehouse platform to create a fully automated machine learning model. The Random Forest Classifier used 800 thousand existing distinct first names, ages, and over 700 variables derived from the letter composition of first names to predict gender. It leveraged MLflow to track the accuracy of models and log the metrics overtime, and registered multiple model versions to pick the best for production. Model training and scoring were scheduled and the auto ML pipeline significantly minimized manual working hours after the initial set up. As a result, AARP dramatically (and accurately) improved the coverage of gender information from about 92.5% to 99.5%.
1. Gender Prediction with
Databricks AutoML Pipeline
Sharon Xu
Senior Advisor, Advanced Analytics at AARP
Qing Sun
Resident Solutions Architect, Databricks
2. Sharon Xu, PhD
§ Senior Advanced Analytics Advisor, AARP
§ Focus on Market Targeting with Machine
Learning, Incremental and Attribution
Analysis, and Digital Behavioral Analytics
§ Graduated from University of Maryland,
PhD in Civil Engineering (transportation
demand forecasting)
§ Quite busy in gardening during pandemic
Qing Sun
§ Resident Solutions Architect, Databricks
§ Data Scientist and Machine Learning
SME for Public Sector Professional
Services
§ Thought about climbing Everest
3. Agenda
§ Background
▪ The scope of AARP’s predictive models,
business need of Gender Model
§ Gender Model Approaches
▪ Data used and derived variables,
methodology, and performance
§ AutoML Pipeline for Gender
Model
§ Learnings and Future
4. AARP (formerly called the American Association of
Retired Persons) has about 38 million members. It has
been championing positive social change and delivering
value to our 50 plus members and empowering people to
choose how they live as they age for decades.
5. Background - Scope of AARP’s Predictive Models
▪ DM
▪ Alt Media
▪ Lead Gen
▪ Renewal Propensity
▪ Transition to Auto
Renew
▪ Live Answer
▪ Email Click
▪ Online (Gmail,
Facebook, Display
responder, etc.)
• Channel Preference
Model
• Membership Joins or
Renews Model
▪ Advocacy
▪ Foundation
▪ Event Interest
▪ Online Program
Interest
• Member Interest
Model
▪ Demographics (e.g.
diversity indicator)
▪ Imputation (e.g.
gender, third party
vendor data)
• Member Information
Model
6. Why Gender Model?
• In AARP’s targeting audience universe, there are about 2 Million missing gender
information.
• Less accurate profiling and targeting strageties were conducted due to the lack of
gender information.
• Will be a simple use case to test the functions of MLflow, Model Registry, and
automate the entire ML process.
7. Random Forest Classifier to identify gender using the existing first names,
ages, and many variables derived from alphabetical order of first names.
Gender Model Approaches
8. Approach
A total of 731 model variables were derived
using First Name, these 731 variables can
be divided into seven different categories:
1. Frequency of Letters (26 variables)
2. Sum of Position of Letters (26 variables)
3. Frequency of Bigrams (26x26 variables)
4. Second-To-Last Character Position [SLC]
5. Last Character Position [LC]
6. Length of Name [LoN]
7. Average Age [AA]
FREQUENCY OF
LETTERS
a b c d e f g h i j k l m n •••
S h a r o n 1 - - - - - - 1 - - - - - 1 --
Q i n g - - - - - - 1 - 1 - - - - 1 --
SUM OF POSITION
OF LETTERS
a b c d e f g h i j k l m n •••
S h a r o n
3 - - - - - - 2 - - - - - 6 --
Q i n g
- - - - - - 4 - 2 - - - - 3 --
1 2 3 4 5 6
1 2 3 4
FREQUENCY
OF BIGRAMS
aa ab ac ••• aq ar ••• ba bb ••• im in io ••• zz
S h a r o n - - - – - 1 – - - – - - - – -
Q i n g - - - – - - – - - – - 1 - – -
OTHER VARIABLES SLC LC LoN AA
S h a r o n
15 14 6 38
Q i n g
14 7 4 35
15 14
14 7
9. Logistic
Regression
Performance
• Various ML models were tested and RF was the final winner.
• The model displays 76% accuracy of gender prediction on purely new
distinct names (not existing in current model dataset).
• The model displays 90% accuracy of gender prediction on a mixed
name list (with and without names in modeling dataset).
• The model can only be used to predict binary gender information
due to the lack of non-binary data.
Random Forest
Classifier
Support
Vector
Machines
Naïve
Bayes
Classifier
Multilayer
Perceptron
11. AutoML Pipeline Overview
DATA LAKE
Pull name/gender/age,
data cleansing and
aggregation
Create 700+
features and
save as Delta
table
Build and train ML pipeline,
Parameter tuning,
Pick the best model
MLflow Tracking: log tuning
matrix and evaluation matrix,
best Model Registry
Model deployment: load
model from Model
Registry, scoring
Schedule jobs:
Train/ Score
17. Learnings and Future
§ Fast - Less worries on large dataset and tuning
iterations with customized cluster and
parallelism setting; storing data in Delta format
speed up the pipeline;
§ Model Management - MLflow easily tracks
all model runs’ results; Model Registry
simplifies model management job;
§ Self-training and scoring – Job Schedule
completes model retraining and scoring
regularly without human intervention.
§ Integrated data platform to minimize flat
file intake or avoid moving modeling dataset
across different platforms;
§ To keep using MLflow and Model
Registry to well organize model results and
storage;
§ To Streamline the entire ML pipeline to
have the structure setup only once and
schedule the retrain/score regularly with full
automation…
To ensure the future ML pipeline automated
(as much as possible), we need:
Then why do we
need modelers? …