Democratizing PySpark for Mobile Game Publishing at Zynga
Zynga aimed to standardize on PySpark and make it accessible to all analytics teams. This was done through trainings, documentation, and mentoring. Teams are now responsible for data products in production using features like Pandas UDFs. Democratizing PySpark resulted in novel applications like propensity modeling, anomaly detection, and reinforcement learning models in production.
4. Takeaways
▪ Databricks is available for all analytics team members
▪ We experienced growing pains when scaling adoption
▪ Trainings and policies helped our team leverage PySpark
▪ Democratizing PySpark has resulted in novel applications
8. Zynga Analytics
▪ Analytics Engineering
▪ Develops out data platform
▪ Embedded Analytics
▪ Partners with game teams
▪ Central Analytics
▪ Partners with publishing teams
12. Motivation
▪ Level-up our teams
▪ Standardize tooling
▪ Evolve our data platform
▪ Support large-scale analyses
▪ Distribute ownership of data products
13. Training
▪ Onboarding
▪ Training notebooks
▪ Wiki documentation
▪ Mentoring
▪ Offsite events
▪ PySpark office hours
▪ Collaboration
▪ Cross-team projects
▪ Peer review
14. Features
▪ Pandas UDFs
▪ Enables our teams to reuse
Python code in a distributed
computing environment
▪ Applied to Featuretools,
XGBoost, and Keras
▪ Koalas
▪ Provides an intermediate step
between Python and Spark
dataframes
15. DevOps for Data Products
▪ Analytics teams are now responsible for data products in production
▪ Model Pipelines
▪ Batch prediction models
▪ Export results to S3 or Couchbase
▪ Served via experimentation platform
▪ Model Endpoints
▪ Analytics teams responsible for
model design and data inputs
▪ Served with AWS SageMaker
17. Zynga Databricks Library
▪ We authored an in-house library for simplifying tasks in Spark
▪ Functionality
▪ Querying Data Stores
▪ Publishing Results
▪ Model Monitoring
▪ Airflow Helper
18. Cluster Management
▪ Issue
▪ Allowing anyone to install or upgrade libraries often resulted in job failures
▪ Resolution
▪ We have development clusters
with fixed library versions
▪ All scheduled jobs run on
ephemeral clusters
▪ New development clusters are
rolled out following major releases
19. Job Ownership
▪ Issue
▪ Having individual owners for jobs often resulted in orphaned data products
▪ Resolution
▪ Jobs now have backup owners and are mapped to teams
▪ We monitor datasets for inactivity to flag jobs for sunset
▪ Robust jobs are migrated to Airflow
▪ We’ve reduced the number of jobs by focusing on portfolio-scale data products
20. Cost Tracking
▪ Issue
▪ We did not have good visibility into which jobs where consuming the most resources
▪ Resolution
▪ We set up tags for tracking
team and project utilization
▪ Tagging is automated through our
cluster provisioning process
21. Support
▪ Issue
▪ General questions about how PySpark works or why a notebook task is failing
▪ Jobs failures that are non-trivial to trace
▪ Resolution
▪ Establish SLAs for responding to questions
▪ Empower more of the team to answer questions
▪ Pair up new hires with experienced users
▪ Use cross-team projects to continue
developing PySpark knowledge
23. Propensity Modeling
▪ AutoModel System
▪ Builds hundreds of propensity models daily
▪ Predicts likelihood of users to lapse in activity or make a purchase
▪ Leverages the Featuretools library to automate feature engineering
24. Segmentation
▪ Player Archetypes
▪ We use MLlib to cluster users based on gameplay behavior
▪ Segments can be used for experimentation and marketing
25. Anomaly Detection
▪ Autoencoder for Cheat Detection
▪ Players are represented as 1D images
▪ Players with large vector differences are flagged as suspect
26. Economy Simulation
▪ Markov Chains in PySpark
▪ We predict the outcome of game updates using millions of simulated playthroughs
▪ Encoding users as Spark dataframes enables scalable simulations
28. Reinforcement Learning
▪ Personalization Pipeline
▪ Real-time model serving
▪ In Production
▪ Words with Friends 2
▪ CSR Racing 2
▪ Open Source
▪ RL Bakery
▪ https://github.com/zynga/rl-bakery
30. Takeaways
▪ We are encouraging our entire analytics organization to use PySpark
▪ Driving adoption may require training and hands-on support
▪ It’s important to put policies in place for maintainability and cost
▪ Opening up PySpark to more teams has resulted in useful applications