Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Democratizing PySpark for Mobile Game Publishing

98 views

Published on

At Zynga we’ve opened up our PySpark environment to our full analytics organization, which includes game analytics, data science, and engineering teams across the globe.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Democratizing PySpark for Mobile Game Publishing

  1. 1. Democratizing PySpark for Mobile Game Publishing Ben Weber Distinguished Data Scientist @ Zynga
  2. 2. Connecting the world through games
  3. 3. Takeaways ▪ Databricks is available for all analytics team members ▪ We experienced growing pains when scaling adoption ▪ Trainings and policies helped our team leverage PySpark ▪ Democratizing PySpark has resulted in novel applications
  4. 4. Agenda Mobile Game Publishing Democratizing PySpark Learnings Applications
  5. 5. Mobile Game Publishing
  6. 6. Zynga Games
  7. 7. Zynga Analytics ▪ Analytics Engineering ▪ Develops out data platform ▪ Embedded Analytics ▪ Partners with game teams ▪ Central Analytics ▪ Partners with publishing teams
  8. 8. Zynga’s Publishing Platform ▪ Analytics & Reporting ▪ Experimentation Platform ▪ Personalization Services ▪ Marketing Optimization
  9. 9. Zynga’s Analytics Evolution Notebook EraSQL Era Production Era 2007 - 2017 2019 - Present2017 - 2019
  10. 10. Democratizing PySpark
  11. 11. Motivation ▪ Level-up our teams ▪ Standardize tooling ▪ Evolve our data platform ▪ Support large-scale analyses ▪ Distribute ownership of data products
  12. 12. Training ▪ Onboarding ▪ Training notebooks ▪ Wiki documentation ▪ Mentoring ▪ Offsite events ▪ PySpark office hours ▪ Collaboration ▪ Cross-team projects ▪ Peer review
  13. 13. Features ▪ Pandas UDFs ▪ Enables our teams to reuse Python code in a distributed computing environment ▪ Applied to Featuretools, XGBoost, and Keras ▪ Koalas ▪ Provides an intermediate step between Python and Spark dataframes
  14. 14. DevOps for Data Products ▪ Analytics teams are now responsible for data products in production ▪ Model Pipelines ▪ Batch prediction models ▪ Export results to S3 or Couchbase ▪ Served via experimentation platform ▪ Model Endpoints ▪ Analytics teams responsible for model design and data inputs ▪ Served with AWS SageMaker
  15. 15. Learnings
  16. 16. Zynga Databricks Library ▪ We authored an in-house library for simplifying tasks in Spark ▪ Functionality ▪ Querying Data Stores ▪ Publishing Results ▪ Model Monitoring ▪ Airflow Helper
  17. 17. Cluster Management ▪ Issue ▪ Allowing anyone to install or upgrade libraries often resulted in job failures ▪ Resolution ▪ We have development clusters with fixed library versions ▪ All scheduled jobs run on ephemeral clusters ▪ New development clusters are rolled out following major releases
  18. 18. Job Ownership ▪ Issue ▪ Having individual owners for jobs often resulted in orphaned data products ▪ Resolution ▪ Jobs now have backup owners and are mapped to teams ▪ We monitor datasets for inactivity to flag jobs for sunset ▪ Robust jobs are migrated to Airflow ▪ We’ve reduced the number of jobs by focusing on portfolio-scale data products
  19. 19. Cost Tracking ▪ Issue ▪ We did not have good visibility into which jobs where consuming the most resources ▪ Resolution ▪ We set up tags for tracking team and project utilization ▪ Tagging is automated through our cluster provisioning process
  20. 20. Support ▪ Issue ▪ General questions about how PySpark works or why a notebook task is failing ▪ Jobs failures that are non-trivial to trace ▪ Resolution ▪ Establish SLAs for responding to questions ▪ Empower more of the team to answer questions ▪ Pair up new hires with experienced users ▪ Use cross-team projects to continue developing PySpark knowledge
  21. 21. Applications
  22. 22. Propensity Modeling ▪ AutoModel System ▪ Builds hundreds of propensity models daily ▪ Predicts likelihood of users to lapse in activity or make a purchase ▪ Leverages the Featuretools library to automate feature engineering
  23. 23. Segmentation ▪ Player Archetypes ▪ We use MLlib to cluster users based on gameplay behavior ▪ Segments can be used for experimentation and marketing
  24. 24. Anomaly Detection ▪ Autoencoder for Cheat Detection ▪ Players are represented as 1D images ▪ Players with large vector differences are flagged as suspect
  25. 25. Economy Simulation ▪ Markov Chains in PySpark ▪ We predict the outcome of game updates using millions of simulated playthroughs ▪ Encoding users as Spark dataframes enables scalable simulations
  26. 26. Experimentation ▪ Significance Testing at Scale ▪ Pandas UDFs for divide and conquer ▪ Distributed SciPy, NumPy, and StatsModels
  27. 27. Reinforcement Learning ▪ Personalization Pipeline ▪ Real-time model serving ▪ In Production ▪ Words with Friends 2 ▪ CSR Racing 2 ▪ Open Source ▪ RL Bakery ▪ https://github.com/zynga/rl-bakery
  28. 28. Conclusion
  29. 29. Takeaways ▪ We are encouraging our entire analytics organization to use PySpark ▪ Driving adoption may require training and hands-on support ▪ It’s important to put policies in place for maintainability and cost ▪ Opening up PySpark to more teams has resulted in useful applications
  30. 30. Democratizing PySpark at Zynga Ben Weber @bgweber We are hiring! Zynga.com/jobs
  31. 31. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×