SlideShare a Scribd company logo
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Ben Weber, Zynga
Automating Predictive Modeling
at Zynga with Pandas UDFs
#UnifiedAnalytics #SparkAISummit
Zynga Analytics
3#UnifiedAnalytics #SparkAISummit
Zynga Portfolio
4#UnifiedAnalytics #SparkAISummit
Our Challenge
• We want to build game-specific models for
behaviors such as likelihood to purchase
• Our games have diverse event taxonomies
• We have tens of millions of players and
dozens of games across multiple platforms
5#UnifiedAnalytics #SparkAISummit
Our Approach
• Featuretools for automating feature engineering
• Pandas UDFs for distributing Featuretools
• Databricks for building our model pipeline
6#UnifiedAnalytics #SparkAISummit
AutoModel
• Zynga’s first portfolio-scale data product
• Generates hundreds of propensity models
• Powers features in our games & live services
7#UnifiedAnalytics #SparkAISummit
AutoModel Pipeline
8#UnifiedAnalytics #SparkAISummit
Data
Extract
Feature
Engineering
Feature
Application
Model
Training
Model
Publish
Data Extraction
9#UnifiedAnalytics #SparkAISummit
Data
Extract
Feature
Engineering
Feature
Application
Model
Training
Model
Publish
S3 & Parquet
Feature Engineering
10#UnifiedAnalytics #SparkAISummit
Data
Extract
Feature
Engineering
Feature
Application
Model
Training
Model
Publish
Automated Feature Engineering
• Goals
– Translate our narrow and deep data tables into a shallow
and wide representation
– Support dozens of titles with diverse event taxonomies
– Scale to billions of records and millions of players
– Minimize manual data science workflows
11#UnifiedAnalytics #SparkAISummit
Feature Tools
• A python library for deep feature synthesis
• Represents data as entity sets
• Identifies feature descriptors for transforming
your data into new representations
12#UnifiedAnalytics #SparkAISummit
Entity Sets
13#UnifiedAnalytics #SparkAISummit
Entityset: transactions
Entities:
customers (shape = [5, 3])
transactions (shape = [500, 5])
Relationships:
transactions.customer_id -> customers.customer_id
• Define the relationships between tables
• Work with Pandas data frames
Feature Synthesis
14#UnifiedAnalytics #SparkAISummit
import featuretools as ft
feature_matrix, features_defs = ft.dfs(entityset=es, target_entity="customers")
feature_matrix.head(5)
customer_id zip_code count(transactions) sum(transactions.amounts)
1 91000 0 0
2 91000 10 120.5
3 91005 5 17.96
4 91005 2 9.99
5 91000 3 29.97
Using Featuretools
import featuretools as ft
# 1-hot encode the raw event data
es = ft.EntitySet(id="events")
es = es.entity_from_dataframe(entity_id="events", dataframe=rawDataDF)
feature_matrix, defs = ft.dfs(entityset=es, target_entity="events", max_depth=1)
encodedDF, encoders = ft.encode_features(feature_matrix, defs)
# perform deep feature synthesis on the encoded data
es = ft.EntitySet(id="events")
es = es.entity_from_dataframe(entity_id="events", dataframe=encodedDF)
es = es.normalize_entity(base_entity_id="events", new_entity_id="users", index="user_id")
generated_features, descriptors = ft.dfs(entityset=es, target_entity="users", max_depth=3)
15#UnifiedAnalytics #SparkAISummit
Scaling Up
• Parallelize the process
• Translate feature descriptions to Spark SQL
• Find a way to distribute the task
16#UnifiedAnalytics #SparkAISummit
Feature Application
17#UnifiedAnalytics #SparkAISummit
Data
Extract
Feature
Engineering
Feature
Application
Model
Training
Model
Publish
Pandas UDFs
Pandas UDFs
• Introduced in Spark 2.3
• Provide Scalar and Grouped map operations
• Partitioned using a groupby clause
• Enable distributing code that uses Pandas
18#UnifiedAnalytics #SparkAISummit
Grouped Map UDFs
19#UnifiedAnalytics #SparkAISummit
UDF
Pandas
Output
Pandas
Input
Spark Output
Spark Input
UDF
Pandas
Output
Pandas
Input
UDF
Pandas
Output
Pandas
Input
UDF
Pandas
Output
Pandas
Input
UDF
Pandas
Output
Pandas
Input
When to use UDFs?
• You need to operate on Pandas data frames
• Your data can be represented as a single Spark
data frame
• You can partition your data set
20#UnifiedAnalytics #SparkAISummit
Distributing SciPy
21#UnifiedAnalytics #SparkAISummit
schema = StructType([StructField('ID', LongType(), True),
StructField('b0', DoubleType(), True),
StructField('b1', DoubleType(), True)])
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def analyze_player(player_pd):
result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits))
return pd.DataFrame({'ID': [player_pd.player_id[0]],
'b0' : result[0][1], 'b1' : result[0][1] })
result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
Step 1: Define the schema
22#UnifiedAnalytics #SparkAISummit
schema = StructType([StructField('ID', LongType(), True),
StructField('b0', DoubleType(), True),
StructField('b1', DoubleType(), True)])
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def analyze_player(player_pd):
result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits))
return pd.DataFrame({'ID': [player_pd.player_id[0]],
'b0' : result[0][1], 'b1' : result[0][1] })
result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
Step 2: Choose a partition
23#UnifiedAnalytics #SparkAISummit
schema = StructType([StructField('ID', LongType(), True),
StructField('b0', DoubleType(), True),
StructField('b1', DoubleType(), True)])
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def analyze_player(player_pd):
result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits))
return pd.DataFrame({'ID': [player_pd.player_id[0]],
'b0' : result[0][1], 'b1' : result[0][1] })
result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
Step 3: Use Pandas
24#UnifiedAnalytics #SparkAISummit
schema = StructType([StructField('ID', LongType(), True),
StructField('b0', DoubleType(), True),
StructField('b1', DoubleType(), True)])
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def analyze_player(player_pd):
result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits))
return pd.DataFrame({'ID': [player_pd.player_id[0]],
'b0' : result[0][1], 'b1' : result[0][1] })
result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
Step 4: Return Pandas
25#UnifiedAnalytics #SparkAISummit
schema = StructType([StructField('ID', LongType(), True),
StructField('b0', DoubleType(), True),
StructField('b1', DoubleType(), True)])
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def analyze_player(player_pd):
result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits))
return pd.DataFrame({'ID': [player_pd.player_id[0]],
'b0' : result[0][1], 'b1' : result[0][1] })
result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
Distributing Featuretools
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def apply_feature_generation(pandasInputDF):
# create Entity Set representation
es = ft.EntitySet(id="events")
es = es.entity_from_dataframe(entity_id="events", dataframe=pandasInputDF)
es = es.normalize_entity(base_entity_id="events", new_entity_id="users", index="user_id")
# apply the feature calculation and return the result
return ft.calculate_feature_matrix(saved_features, es)
sparkFeatureDF = sparkInputDF.groupby('user_group').apply(apply_feature_generation)
26#UnifiedAnalytics #SparkAISummit
Issues with Pandas UDFs
• Debugging is a challenge
• Pushes the limits of Apache Arrow
• Data type mismatches
• Schema needs to be known before execution
27#UnifiedAnalytics #SparkAISummit
Model Training & Scoring
28#UnifiedAnalytics #SparkAISummit
Data
Extract
Feature
Engineering
Feature
Application
Model
Training
Model
Publish
MLlib
Propensity Models
• Classification models
– Gradient-Boosted Trees
– XGBoost
• Hyperparameter tuning
– ParamGridBuilder
– CrossValidator
29#UnifiedAnalytics #SparkAISummit
Model Application
30#UnifiedAnalytics #SparkAISummit
Data
Extract
Feature
Engineering
Feature
Application
Model
Training
Model
Publish
Couchbase
Productizing with Data Bricks
31#UnifiedAnalytics #SparkAISummit
Driver Notebook
Thread pool
Publish Scores
Model Notebook
Game 1
Model Notebook
Game 2
Model Notebook
Game 3
Jobs
API
Pandas UDFs at Zynga
• AutoModel
– Featuretools
• Experimentation
– StatsModels
– SciPy
– NumPy
32#UnifiedAnalytics #SparkAISummit
Machine Learning at Zynga
Old Approach
• Custom data science and
engineering work per model
• Months-long development process
• Ad-hoc process for productizing
models
New Approach
• Minimal effort for building new
propensity models
• No custom work for new games
• Predictions are deployed to
our application database
33#UnifiedAnalytics #SparkAISummit
Takeaways
• Pandas UDFs unlock a new magnitude of
processing for Python libraries
• Zynga is using PySpark to build portfolio-scale
data products
34#UnifiedAnalytics #SparkAISummit
Questions?
• We are hiring! Zynga.com/jobs
Ben Weber Zynga Analytics @bgweber
35#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot

Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
Sung Yub Kim
 
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Edureka!
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
halifaxchester
 
Matrix factorization
Matrix factorizationMatrix factorization
Matrix factorization
Luis Serrano
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
Sri Ambati
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
Andrew Henshaw
 
Python matplotlib cheat_sheet
Python matplotlib cheat_sheetPython matplotlib cheat_sheet
Python matplotlib cheat_sheet
Nishant Upadhyay
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
Daniel Hen
 
Machine Learning In Python | Python Machine Learning Tutorial | Deep Learning...
Machine Learning In Python | Python Machine Learning Tutorial | Deep Learning...Machine Learning In Python | Python Machine Learning Tutorial | Deep Learning...
Machine Learning In Python | Python Machine Learning Tutorial | Deep Learning...
Edureka!
 
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Amazon Web Services
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
Xgboost
XgboostXgboost
Machine Learning lecture6(regularization)
Machine Learning lecture6(regularization)Machine Learning lecture6(regularization)
Machine Learning lecture6(regularization)
cairo university
 
KNN
KNN KNN
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
High Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNEHigh Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNE
Kai-Wen Zhao
 
BigQuery MLの行列分解モデルを 用いた推薦システムの基礎
BigQuery MLの行列分解モデルを 用いた推薦システムの基礎BigQuery MLの行列分解モデルを 用いた推薦システムの基礎
BigQuery MLの行列分解モデルを 用いた推薦システムの基礎
幸太朗 岩澤
 
Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 I
Sungbin Lim
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
VARUN KUMAR
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
ankit_ppt
 

What's hot (20)

Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
Matrix factorization
Matrix factorizationMatrix factorization
Matrix factorization
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
 
Python matplotlib cheat_sheet
Python matplotlib cheat_sheetPython matplotlib cheat_sheet
Python matplotlib cheat_sheet
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Machine Learning In Python | Python Machine Learning Tutorial | Deep Learning...
Machine Learning In Python | Python Machine Learning Tutorial | Deep Learning...Machine Learning In Python | Python Machine Learning Tutorial | Deep Learning...
Machine Learning In Python | Python Machine Learning Tutorial | Deep Learning...
 
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Xgboost
XgboostXgboost
Xgboost
 
Machine Learning lecture6(regularization)
Machine Learning lecture6(regularization)Machine Learning lecture6(regularization)
Machine Learning lecture6(regularization)
 
KNN
KNN KNN
KNN
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
High Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNEHigh Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNE
 
BigQuery MLの行列分解モデルを 用いた推薦システムの基礎
BigQuery MLの行列分解モデルを 用いた推薦システムの基礎BigQuery MLの行列分解モデルを 用いた推薦システムの基礎
BigQuery MLの行列分解モデルを 用いた推薦システムの基礎
 
Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 I
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
 
Ml10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topicsMl10 dimensionality reduction-and_advanced_topics
Ml10 dimensionality reduction-and_advanced_topics
 

Similar to Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs

Strategies for refactoring and migrating a big old project to be multilingual...
Strategies for refactoring and migrating a big old project to be multilingual...Strategies for refactoring and migrating a big old project to be multilingual...
Strategies for refactoring and migrating a big old project to be multilingual...
benjaoming
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-steps
Matteo Moci
 
Relevance trilogy may dream be with you! (dec17)
Relevance trilogy  may dream be with you! (dec17)Relevance trilogy  may dream be with you! (dec17)
Relevance trilogy may dream be with you! (dec17)
Woonsan Ko
 
When you need more data in less time...
When you need more data in less time...When you need more data in less time...
When you need more data in less time...
Bálint Horváth
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
Miguel González-Fierro
 
Ember.js Tokyo event 2014/09/22 (English)
Ember.js Tokyo event 2014/09/22 (English)Ember.js Tokyo event 2014/09/22 (English)
Ember.js Tokyo event 2014/09/22 (English)
Yuki Shimada
 
Educate 2017: Customizing Assessments: Why extending the APIs is easier than ...
Educate 2017: Customizing Assessments: Why extending the APIs is easier than ...Educate 2017: Customizing Assessments: Why extending the APIs is easier than ...
Educate 2017: Customizing Assessments: Why extending the APIs is easier than ...
Learnosity
 
Taking Web Apps Offline
Taking Web Apps OfflineTaking Web Apps Offline
Taking Web Apps OfflinePedro Morais
 
Big Objects in Salesforce
Big Objects in SalesforceBig Objects in Salesforce
Big Objects in Salesforce
Amit Chaudhary
 
Drupal 8. Search API. Facets. Customize / combine facets
Drupal 8. Search API. Facets. Customize / combine facetsDrupal 8. Search API. Facets. Customize / combine facets
Drupal 8. Search API. Facets. Customize / combine facets
AnyforSoft
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wycieków
Konrad Kokosa
 
Ibis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkIbis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache Spark
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Databricks
 
Spark AI 2020
Spark AI 2020Spark AI 2020
Spark AI 2020
Li Jin
 
Introduction to Swagger
Introduction to SwaggerIntroduction to Swagger
Introduction to Swagger
Knoldus Inc.
 
Magento Indexes
Magento IndexesMagento Indexes
Magento Indexes
Ivan Chepurnyi
 
SFScon17 - Patrick Puecher: "Exploring data with Elasticsearch and Kibana"
SFScon17 - Patrick Puecher: "Exploring data with Elasticsearch and Kibana"SFScon17 - Patrick Puecher: "Exploring data with Elasticsearch and Kibana"
SFScon17 - Patrick Puecher: "Exploring data with Elasticsearch and Kibana"
South Tyrol Free Software Conference
 
What's Coming Next in Sencha Frameworks
What's Coming Next in Sencha FrameworksWhat's Coming Next in Sencha Frameworks
What's Coming Next in Sencha Frameworks
Grgur Grisogono
 

Similar to Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs (20)

Strategies for refactoring and migrating a big old project to be multilingual...
Strategies for refactoring and migrating a big old project to be multilingual...Strategies for refactoring and migrating a big old project to be multilingual...
Strategies for refactoring and migrating a big old project to be multilingual...
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-steps
 
Relevance trilogy may dream be with you! (dec17)
Relevance trilogy  may dream be with you! (dec17)Relevance trilogy  may dream be with you! (dec17)
Relevance trilogy may dream be with you! (dec17)
 
When you need more data in less time...
When you need more data in less time...When you need more data in less time...
When you need more data in less time...
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
Ember.js Tokyo event 2014/09/22 (English)
Ember.js Tokyo event 2014/09/22 (English)Ember.js Tokyo event 2014/09/22 (English)
Ember.js Tokyo event 2014/09/22 (English)
 
Educate 2017: Customizing Assessments: Why extending the APIs is easier than ...
Educate 2017: Customizing Assessments: Why extending the APIs is easier than ...Educate 2017: Customizing Assessments: Why extending the APIs is easier than ...
Educate 2017: Customizing Assessments: Why extending the APIs is easier than ...
 
Taking Web Apps Offline
Taking Web Apps OfflineTaking Web Apps Offline
Taking Web Apps Offline
 
Big Objects in Salesforce
Big Objects in SalesforceBig Objects in Salesforce
Big Objects in Salesforce
 
Drupal 8. Search API. Facets. Customize / combine facets
Drupal 8. Search API. Facets. Customize / combine facetsDrupal 8. Search API. Facets. Customize / combine facets
Drupal 8. Search API. Facets. Customize / combine facets
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wycieków
 
Ibis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkIbis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache Spark
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
 
Spark AI 2020
Spark AI 2020Spark AI 2020
Spark AI 2020
 
Introduction to Swagger
Introduction to SwaggerIntroduction to Swagger
Introduction to Swagger
 
Magento Indexes
Magento IndexesMagento Indexes
Magento Indexes
 
SFScon17 - Patrick Puecher: "Exploring data with Elasticsearch and Kibana"
SFScon17 - Patrick Puecher: "Exploring data with Elasticsearch and Kibana"SFScon17 - Patrick Puecher: "Exploring data with Elasticsearch and Kibana"
SFScon17 - Patrick Puecher: "Exploring data with Elasticsearch and Kibana"
 
What's Coming Next in Sencha Frameworks
What's Coming Next in Sencha FrameworksWhat's Coming Next in Sencha Frameworks
What's Coming Next in Sencha Frameworks
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 

Recently uploaded (20)

Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 

Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Ben Weber, Zynga Automating Predictive Modeling at Zynga with Pandas UDFs #UnifiedAnalytics #SparkAISummit
  • 5. Our Challenge • We want to build game-specific models for behaviors such as likelihood to purchase • Our games have diverse event taxonomies • We have tens of millions of players and dozens of games across multiple platforms 5#UnifiedAnalytics #SparkAISummit
  • 6. Our Approach • Featuretools for automating feature engineering • Pandas UDFs for distributing Featuretools • Databricks for building our model pipeline 6#UnifiedAnalytics #SparkAISummit
  • 7. AutoModel • Zynga’s first portfolio-scale data product • Generates hundreds of propensity models • Powers features in our games & live services 7#UnifiedAnalytics #SparkAISummit
  • 11. Automated Feature Engineering • Goals – Translate our narrow and deep data tables into a shallow and wide representation – Support dozens of titles with diverse event taxonomies – Scale to billions of records and millions of players – Minimize manual data science workflows 11#UnifiedAnalytics #SparkAISummit
  • 12. Feature Tools • A python library for deep feature synthesis • Represents data as entity sets • Identifies feature descriptors for transforming your data into new representations 12#UnifiedAnalytics #SparkAISummit
  • 13. Entity Sets 13#UnifiedAnalytics #SparkAISummit Entityset: transactions Entities: customers (shape = [5, 3]) transactions (shape = [500, 5]) Relationships: transactions.customer_id -> customers.customer_id • Define the relationships between tables • Work with Pandas data frames
  • 14. Feature Synthesis 14#UnifiedAnalytics #SparkAISummit import featuretools as ft feature_matrix, features_defs = ft.dfs(entityset=es, target_entity="customers") feature_matrix.head(5) customer_id zip_code count(transactions) sum(transactions.amounts) 1 91000 0 0 2 91000 10 120.5 3 91005 5 17.96 4 91005 2 9.99 5 91000 3 29.97
  • 15. Using Featuretools import featuretools as ft # 1-hot encode the raw event data es = ft.EntitySet(id="events") es = es.entity_from_dataframe(entity_id="events", dataframe=rawDataDF) feature_matrix, defs = ft.dfs(entityset=es, target_entity="events", max_depth=1) encodedDF, encoders = ft.encode_features(feature_matrix, defs) # perform deep feature synthesis on the encoded data es = ft.EntitySet(id="events") es = es.entity_from_dataframe(entity_id="events", dataframe=encodedDF) es = es.normalize_entity(base_entity_id="events", new_entity_id="users", index="user_id") generated_features, descriptors = ft.dfs(entityset=es, target_entity="users", max_depth=3) 15#UnifiedAnalytics #SparkAISummit
  • 16. Scaling Up • Parallelize the process • Translate feature descriptions to Spark SQL • Find a way to distribute the task 16#UnifiedAnalytics #SparkAISummit
  • 18. Pandas UDFs • Introduced in Spark 2.3 • Provide Scalar and Grouped map operations • Partitioned using a groupby clause • Enable distributing code that uses Pandas 18#UnifiedAnalytics #SparkAISummit
  • 19. Grouped Map UDFs 19#UnifiedAnalytics #SparkAISummit UDF Pandas Output Pandas Input Spark Output Spark Input UDF Pandas Output Pandas Input UDF Pandas Output Pandas Input UDF Pandas Output Pandas Input UDF Pandas Output Pandas Input
  • 20. When to use UDFs? • You need to operate on Pandas data frames • Your data can be represented as a single Spark data frame • You can partition your data set 20#UnifiedAnalytics #SparkAISummit
  • 21. Distributing SciPy 21#UnifiedAnalytics #SparkAISummit schema = StructType([StructField('ID', LongType(), True), StructField('b0', DoubleType(), True), StructField('b1', DoubleType(), True)]) @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def analyze_player(player_pd): result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits)) return pd.DataFrame({'ID': [player_pd.player_id[0]], 'b0' : result[0][1], 'b1' : result[0][1] }) result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
  • 22. Step 1: Define the schema 22#UnifiedAnalytics #SparkAISummit schema = StructType([StructField('ID', LongType(), True), StructField('b0', DoubleType(), True), StructField('b1', DoubleType(), True)]) @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def analyze_player(player_pd): result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits)) return pd.DataFrame({'ID': [player_pd.player_id[0]], 'b0' : result[0][1], 'b1' : result[0][1] }) result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
  • 23. Step 2: Choose a partition 23#UnifiedAnalytics #SparkAISummit schema = StructType([StructField('ID', LongType(), True), StructField('b0', DoubleType(), True), StructField('b1', DoubleType(), True)]) @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def analyze_player(player_pd): result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits)) return pd.DataFrame({'ID': [player_pd.player_id[0]], 'b0' : result[0][1], 'b1' : result[0][1] }) result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
  • 24. Step 3: Use Pandas 24#UnifiedAnalytics #SparkAISummit schema = StructType([StructField('ID', LongType(), True), StructField('b0', DoubleType(), True), StructField('b1', DoubleType(), True)]) @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def analyze_player(player_pd): result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits)) return pd.DataFrame({'ID': [player_pd.player_id[0]], 'b0' : result[0][1], 'b1' : result[0][1] }) result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
  • 25. Step 4: Return Pandas 25#UnifiedAnalytics #SparkAISummit schema = StructType([StructField('ID', LongType(), True), StructField('b0', DoubleType(), True), StructField('b1', DoubleType(), True)]) @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def analyze_player(player_pd): result = leastsq(fit, [1, 0], args=(player_pd.shots, player_pd.hits)) return pd.DataFrame({'ID': [player_pd.player_id[0]], 'b0' : result[0][1], 'b1' : result[0][1] }) result_spark_df = spark_df.groupby('player_id').apply(analyze_player)
  • 26. Distributing Featuretools @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def apply_feature_generation(pandasInputDF): # create Entity Set representation es = ft.EntitySet(id="events") es = es.entity_from_dataframe(entity_id="events", dataframe=pandasInputDF) es = es.normalize_entity(base_entity_id="events", new_entity_id="users", index="user_id") # apply the feature calculation and return the result return ft.calculate_feature_matrix(saved_features, es) sparkFeatureDF = sparkInputDF.groupby('user_group').apply(apply_feature_generation) 26#UnifiedAnalytics #SparkAISummit
  • 27. Issues with Pandas UDFs • Debugging is a challenge • Pushes the limits of Apache Arrow • Data type mismatches • Schema needs to be known before execution 27#UnifiedAnalytics #SparkAISummit
  • 28. Model Training & Scoring 28#UnifiedAnalytics #SparkAISummit Data Extract Feature Engineering Feature Application Model Training Model Publish MLlib
  • 29. Propensity Models • Classification models – Gradient-Boosted Trees – XGBoost • Hyperparameter tuning – ParamGridBuilder – CrossValidator 29#UnifiedAnalytics #SparkAISummit
  • 31. Productizing with Data Bricks 31#UnifiedAnalytics #SparkAISummit Driver Notebook Thread pool Publish Scores Model Notebook Game 1 Model Notebook Game 2 Model Notebook Game 3 Jobs API
  • 32. Pandas UDFs at Zynga • AutoModel – Featuretools • Experimentation – StatsModels – SciPy – NumPy 32#UnifiedAnalytics #SparkAISummit
  • 33. Machine Learning at Zynga Old Approach • Custom data science and engineering work per model • Months-long development process • Ad-hoc process for productizing models New Approach • Minimal effort for building new propensity models • No custom work for new games • Predictions are deployed to our application database 33#UnifiedAnalytics #SparkAISummit
  • 34. Takeaways • Pandas UDFs unlock a new magnitude of processing for Python libraries • Zynga is using PySpark to build portfolio-scale data products 34#UnifiedAnalytics #SparkAISummit
  • 35. Questions? • We are hiring! Zynga.com/jobs Ben Weber Zynga Analytics @bgweber 35#UnifiedAnalytics #SparkAISummit
  • 36. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT