Democratizing PySpark for Mobile Game Publishing

Democratizing PySpark for
Mobile Game Publishing
Ben Weber
Distinguished Data Scientist @ Zynga

Connecting the world through games

Takeaways
▪ Databricks is available for all analytics team members
▪ We experienced growing pains when scaling adoption
▪ Trainings and policies helped our team leverage PySpark
▪ Democratizing PySpark has resulted in novel applications

Agenda
Mobile Game Publishing
Democratizing PySpark
Learnings
Applications

Zynga Analytics
▪ Analytics Engineering
▪ Develops out data platform
▪ Embedded Analytics
▪ Partners with game teams
▪ Central Analytics
▪ Partners with publishing teams

Zynga’s Publishing Platform
▪ Analytics & Reporting
▪ Experimentation Platform
▪ Personalization Services
▪ Marketing Optimization

Zynga’s Analytics Evolution
Notebook EraSQL Era Production Era
2007 - 2017 2019 - Present2017 - 2019

Motivation
▪ Level-up our teams
▪ Standardize tooling
▪ Evolve our data platform
▪ Support large-scale analyses
▪ Distribute ownership of data products

Training
▪ Onboarding
▪ Training notebooks
▪ Wiki documentation
▪ Mentoring
▪ Offsite events
▪ PySpark office hours
▪ Collaboration
▪ Cross-team projects
▪ Peer review

Features
▪ Pandas UDFs
▪ Enables our teams to reuse
Python code in a distributed
computing environment
▪ Applied to Featuretools,
XGBoost, and Keras
▪ Koalas
▪ Provides an intermediate step
between Python and Spark
dataframes

DevOps for Data Products
▪ Analytics teams are now responsible for data products in production
▪ Model Pipelines
▪ Batch prediction models
▪ Export results to S3 or Couchbase
▪ Served via experimentation platform
▪ Model Endpoints
▪ Analytics teams responsible for
model design and data inputs
▪ Served with AWS SageMaker

Zynga Databricks Library
▪ We authored an in-house library for simplifying tasks in Spark
▪ Functionality
▪ Querying Data Stores
▪ Publishing Results
▪ Model Monitoring
▪ Airflow Helper

Cluster Management
▪ Issue
▪ Allowing anyone to install or upgrade libraries often resulted in job failures
▪ Resolution
▪ We have development clusters
with fixed library versions
▪ All scheduled jobs run on
ephemeral clusters
▪ New development clusters are
rolled out following major releases

Job Ownership
▪ Issue
▪ Having individual owners for jobs often resulted in orphaned data products
▪ Resolution
▪ Jobs now have backup owners and are mapped to teams
▪ We monitor datasets for inactivity to flag jobs for sunset
▪ Robust jobs are migrated to Airflow
▪ We’ve reduced the number of jobs by focusing on portfolio-scale data products

Cost Tracking
▪ Issue
▪ We did not have good visibility into which jobs where consuming the most resources
▪ Resolution
▪ We set up tags for tracking
team and project utilization
▪ Tagging is automated through our
cluster provisioning process

Support
▪ Issue
▪ General questions about how PySpark works or why a notebook task is failing
▪ Jobs failures that are non-trivial to trace
▪ Resolution
▪ Establish SLAs for responding to questions
▪ Empower more of the team to answer questions
▪ Pair up new hires with experienced users
▪ Use cross-team projects to continue
developing PySpark knowledge

Propensity Modeling
▪ AutoModel System
▪ Builds hundreds of propensity models daily
▪ Predicts likelihood of users to lapse in activity or make a purchase
▪ Leverages the Featuretools library to automate feature engineering

Segmentation
▪ Player Archetypes
▪ We use MLlib to cluster users based on gameplay behavior
▪ Segments can be used for experimentation and marketing

Anomaly Detection
▪ Autoencoder for Cheat Detection
▪ Players are represented as 1D images
▪ Players with large vector differences are flagged as suspect

Economy Simulation
▪ Markov Chains in PySpark
▪ We predict the outcome of game updates using millions of simulated playthroughs
▪ Encoding users as Spark dataframes enables scalable simulations

Experimentation
▪ Significance Testing at Scale
▪ Pandas UDFs for divide and conquer
▪ Distributed SciPy, NumPy, and StatsModels

Reinforcement Learning
▪ Personalization Pipeline
▪ Real-time model serving
▪ In Production
▪ Words with Friends 2
▪ CSR Racing 2
▪ Open Source
▪ RL Bakery
▪ https://github.com/zynga/rl-bakery

Takeaways
▪ We are encouraging our entire analytics organization to use PySpark
▪ Driving adoption may require training and hands-on support
▪ It’s important to put policies in place for maintainability and cost
▪ Opening up PySpark to more teams has resulted in useful applications

Democratizing PySpark at Zynga
Ben Weber @bgweber
We are hiring! Zynga.com/jobs

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Democratizing PySpark for Mobile Game Publishing

Democratizing PySpark for Mobile Game Publishing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Democratizing PySpark for Mobile Game Publishing

Similar to Democratizing PySpark for Mobile Game Publishing (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Democratizing PySpark for Mobile Game Publishing