Gender Prediction with Databricks AutoML Pipeline

Gender Prediction with
Databricks AutoML Pipeline
Sharon Xu
Senior Advisor, Advanced Analytics at AARP
Qing Sun
Resident Solutions Architect, Databricks

Sharon Xu, PhD
§ Senior Advanced Analytics Advisor, AARP
§ Focus on Market Targeting with Machine
Learning, Incremental and Attribution
Analysis, and Digital Behavioral Analytics
§ Graduated from University of Maryland,
PhD in Civil Engineering (transportation
demand forecasting)
§ Quite busy in gardening during pandemic
Qing Sun
§ Resident Solutions Architect, Databricks
§ Data Scientist and Machine Learning
SME for Public Sector Professional
Services
§ Thought about climbing Everest

Agenda
§ Background
▪ The scope of AARP’s predictive models,
business need of Gender Model
§ Gender Model Approaches
▪ Data used and derived variables,
methodology, and performance
§ AutoML Pipeline for Gender
Model
§ Learnings and Future

AARP (formerly called the American Association of
Retired Persons) has about 38 million members. It has
been championing positive social change and delivering
value to our 50 plus members and empowering people to
choose how they live as they age for decades.

Background - Scope of AARP’s Predictive Models
▪ DM
▪ Alt Media
▪ Lead Gen
▪ Renewal Propensity
▪ Transition to Auto
Renew
▪ Live Answer
▪ Email Click
▪ Online (Gmail,
Facebook, Display
responder, etc.)
• Channel Preference
Model
• Membership Joins or
Renews Model
▪ Advocacy
▪ Foundation
▪ Event Interest
▪ Online Program
Interest
• Member Interest
Model
▪ Demographics (e.g.
diversity indicator)
▪ Imputation (e.g.
gender, third party
vendor data)
• Member Information
Model

Why Gender Model?
• In AARP’s targeting audience universe, there are about 2 Million missing gender
information.
• Less accurate profiling and targeting strageties were conducted due to the lack of
gender information.
• Will be a simple use case to test the functions of MLflow, Model Registry, and
automate the entire ML process.

Random Forest Classifier to identify gender using the existing first names,
ages, and many variables derived from alphabetical order of first names.
Gender Model Approaches

Approach
A total of 731 model variables were derived
using First Name, these 731 variables can
be divided into seven different categories:
1. Frequency of Letters (26 variables)
2. Sum of Position of Letters (26 variables)
3. Frequency of Bigrams (26x26 variables)
4. Second-To-Last Character Position [SLC]
5. Last Character Position [LC]
6. Length of Name [LoN]
7. Average Age [AA]
FREQUENCY OF
LETTERS
a b c d e f g h i j k l m n •••
S h a r o n 1 - - - - - - 1 - - - - - 1 --
Q i n g - - - - - - 1 - 1 - - - - 1 --
SUM OF POSITION
OF LETTERS
a b c d e f g h i j k l m n •••
S h a r o n
3 - - - - - - 2 - - - - - 6 --
Q i n g
- - - - - - 4 - 2 - - - - 3 --
1 2 3 4 5 6
1 2 3 4
FREQUENCY
OF BIGRAMS
aa ab ac ••• aq ar ••• ba bb ••• im in io ••• zz
S h a r o n - - - – - 1 – - - – - - - – -
Q i n g - - - – - - – - - – - 1 - – -
OTHER VARIABLES SLC LC LoN AA
S h a r o n
15 14 6 38
Q i n g
14 7 4 35
15 14
14 7

Logistic
Regression
Performance
• Various ML models were tested and RF was the final winner.
• The model displays 76% accuracy of gender prediction on purely new
distinct names (not existing in current model dataset).
• The model displays 90% accuracy of gender prediction on a mixed
name list (with and without names in modeling dataset).
• The model can only be used to predict binary gender information
due to the lack of non-binary data.
Random Forest
Classifier
Support
Vector
Machines
Naïve
Bayes
Classifier
Multilayer
Perceptron

AutoML Pipeline for Gender Model

AutoML Pipeline Overview
DATA LAKE
Pull name/gender/age,
data cleansing and
aggregation
Create 700+
features and
save as Delta
table
Build and train ML pipeline,
Parameter tuning,
Pick the best model
MLflow Tracking: log tuning
matrix and evaluation matrix,
best Model Registry
Model deployment: load
model from Model
Registry, scoring
Schedule jobs:
Train/ Score

Selective code demo
• Feature Engineering (Notebook)

Learnings and Future
§ Fast - Less worries on large dataset and tuning
iterations with customized cluster and
parallelism setting; storing data in Delta format
speed up the pipeline;
§ Model Management - MLflow easily tracks
all model runs’ results; Model Registry
simplifies model management job;
§ Self-training and scoring – Job Schedule
completes model retraining and scoring
regularly without human intervention.
§ Integrated data platform to minimize flat
file intake or avoid moving modeling dataset
across different platforms;
§ To keep using MLflow and Model
Registry to well organize model results and
storage;
§ To Streamline the entire ML pipeline to
have the structure setup only once and
schedule the retrain/score regularly with full
automation…
To ensure the future ML pipeline automated
(as much as possible), we need:
Then why do we
need modelers? …

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Gender Prediction with Databricks AutoML Pipeline

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Gender Prediction with Databricks AutoML Pipeline

Similar to Gender Prediction with Databricks AutoML Pipeline (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Gender Prediction with Databricks AutoML Pipeline