SlideShare a Scribd company logo
Nick Pentreath, IBM
Feature Hashing for
Scalable Machine
Learning
#EUds15
About
• About me
– @MLnick
– Principal Engineer at IBM working on machine
learning & Apache Spark
– Apache Spark committer & PMC
– Author of Machine Learning with Spark
#EUds15 2
Agenda
• Intro to feature hashing
• HashingTF in Spark ML
• FeatureHasher in Spark ML
• Experiments
• Future Work
#EUds15 3
Feature Hashing
4#EUds15
Encoding Features
5#EUds15
2.1 3.2 −0.2 0.7
!!
• Most ML algorithms
operate on numeric
feature vectors
• Features are often
categorical – even
numerical features (e.g.
binning continuous
features)
Encoding Features
• “one-hot” encoding is
popular for categorical
features
• “bag of words” is
popular for text (or token
counts more generally)
0 0 1 0 0
! 0 2 0 1 1
#EUds15 6
High Dimensional Features
• Many domains have very high dense feature dimension
(e.g. images, video)
• Here we’re concerned with sparse feature domains, e.g.
online ads, ecommerce, social networks, video sharing,
text & NLP
• Model sizes can be very large even for simple models
#EUds15 7
!
The “Hashing Trick”
• Use a hash function to map feature values to
indices in the feature vector
Dublin
Hash “city=dublin” 0 0 1 0 0
Modulo hash value
to vector size to get
index offeature
Stock Price
Hash “stock_price” 0 2.60 … 0 0
Modulo hash value
to vector size to get
index offeature
#EUds15 8
Feature Hashing: Pros
• Fast & Simple
• Preserves sparsity
• Memory efficient
– Limits feature vector size
– No need to store mapping feature name -> index
• Online learning
• Easy handling of missing data
• Feature engineering
#EUds15 9
Feature Hashing: Cons
• No inverse mapping => cannot go from feature
indices back to feature names
– Interpretability & feature importances
– But similar issues with other dim reduction techniques
(e.g. random projections, PCA, SVD)
• Hash collisions …
– Impact on accuracy of feature collisions
– Can use signed hash functions to alleviate part of it
#EUds15 10
HashingTF in Spark ML
#EUds15 11
HashingTF Transformer
• Transforms text (sentences) -> term frequency
vectors (aka “bag of words”)
• Uses the “hashing trick” to compute the feature
indices
• Feature value is term frequency (token count)
• Optional parameter to only return binary token
occurrence vector
#EUds15 12
HashingTF Transformer
#EUds15 13
Hacking HashingTF
• HashingTFcan be used for
categorical features…
• … but doesn’t fit neatly into
Pipelines
#EUds15 14
FeatureHasher in Spark ML
#EUds15 15
FeatureHasher
• Flexible, scalable feature encoding using
hashing trick
• Support multiple input columns (numeric or
string / boolean, i.e. categorical)
• One-shot feature encoder
• Core logic similar to HashingTF
#EUds15 16
FeatureHasher
• Operates on entire Row
• … UDF with struct input
#EUds15 17
FeatureHasher
#EUds15 18
• Determining feature
index
– Numeric: feature
name
– String / boolean:
“feature=value”
• String encoding =>
effectively “one hot”
with indicator value
of 1.0
FeatureHasher
• Modulo raw index to feature vector size
• Hash collisions are additive (currently)
#EUds15 19
FeatureHasher
#EUds15 20
Experiments
#EUds15 21
Text Classification
• Kaggle Email Spam Dataset
0.955
0.96
0.965
0.97
0.975
0.98
0.985
0.99
10 12 14 16 18
Hash bits
AUC by hash bits
HashingTF
CountVectorizer
#EUds15 22
Text Classification
• Adding regularization (regParam=0.01)
0.97
0.975
0.98
0.985
0.99
0.995
10 12 14 16 18
Hash bits
AUC by hash bits
HashingTF
CountVectorizer
#EUds15 23
Ad Click Prediction
• Criteo Display Advertising Challenge
– 45m examples, 34m features, 0.000003% sparsity
• Outbrain Click Prediction
– 80m examples, 15m features, 0.000007% sparsity
• Criteo Terabyte Log Data
– 7 day subset
– 1.5b examples, 300m feature, 0.0000003% sparsity
#EUds15 24
Data
• Illustrative characteristics - Criteo DAC
0
2
4
6
8
10
12
Millions
Unique Values per Feature
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Feature Occurence(%)
#EUds15 25
Challenges
Raw Data StringIndexer OneHotEncoder VectorAssembler
OOM!
• Typical one-hot encoding pipeline fails consistently with
large feature dimension
#EUds15 26
Results
• Compare AUC for different # hash bits
0.695
0.7
0.705
0.71
0.715
0.72
18 20 22 24
Hash bits
Outbrain
AUC
0.74
0.742
0.744
0.746
0.748
0.75
0.752
18 20 22 24
Hash bits
Criteo DAC
AUC
#EUds15 27
Results
• Criteo 1T logs – 7 day subset
• Can train model on 1.5b examples
• 300m original features for this subset
• 224 hashed features (16m)
• Impossible with current Spark ML (OOM, 2Gb
broadcast limit)
#EUds15 28
Summary & Future Work
#EUds15 29
Summary
• Feature hashing is a fast, efficient, flexible tool for
feature encoding
• Can scale to high-dimensional sparse data, without
giving up much accuracy
• Supports multi-column “one-shot” encoding
• Avoids common issues with Spark ML Pipelines
using StringIndexer & OneHotEncoder at scale
#EUds15 30
Future Directions
• Will be part of Spark ML in 2.3.0 release (Q4 2017)
– Refer to SPARK-13969 for details
• Allow users to specify set of numeric columns to be
treated as categorical
• Signed hash functions
• Internal feature crossing & namespaces (ala Vowpal
Wabbit)
• DictVectorizer-like transformer => one-pass feature
encoder for multiple numeric & categorical columns (with
inverse mapping) – see SPARK-19962
#EUds15 31
References
• Hash Kernels
• Feature Hashing for Large Scale Multitask
Learning
• Vowpal Wabbit
• Scikit-learn
#EUds15 32
Thank You.
@MLnick
spark.tc

More Related Content

What's hot

Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
Using Apache Spark in the Cloud—A Devops Perspective with Telmo OliveiraUsing Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
Spark Summit
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Spark Summit
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Daniel Rodriguez
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Databricks
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
Jen Aman
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark Streaming
Spark Summit
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftScaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at Lyft
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Databricks
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Databricks
 
Creating Reusable Geospatial Pipelines
Creating Reusable Geospatial PipelinesCreating Reusable Geospatial Pipelines
Creating Reusable Geospatial Pipelines
Databricks
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Databricks
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
Databricks
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Spark Summit
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 

What's hot (20)

Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
Using Apache Spark in the Cloud—A Devops Perspective with Telmo OliveiraUsing Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...Building a Business Logic Translation Engine with Spark Streaming for Communi...
Building a Business Logic Translation Engine with Spark Streaming for Communi...
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark Streaming
 
Scaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at LyftScaling Apache Spark on Kubernetes at Lyft
Scaling Apache Spark on Kubernetes at Lyft
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
 
Creating Reusable Geospatial Pipelines
Creating Reusable Geospatial PipelinesCreating Reusable Geospatial Pipelines
Creating Reusable Geospatial Pipelines
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 

Similar to Feature Hashing for Scalable Machine Learning with Nick Pentreath

Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Spark Summit
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
Spark Summit
 
Math with .NET for you and Azure
Math with .NET for you and AzureMath with .NET for you and Azure
Math with .NET for you and Azure
Marco Parenzan
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Web development basics (Part-7)
Web development basics (Part-7)Web development basics (Part-7)
Web development basics (Part-7)
Rajat Pratap Singh
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
Dataconomy Media
 
Old code doesn't stink
Old code doesn't stinkOld code doesn't stink
Old code doesn't stink
Martin Gutenbrunner
 
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Lucidworks
 
Deep Learning for Search: Personalization and Deep Tokenization
Deep Learning for Search: Personalization and Deep TokenizationDeep Learning for Search: Personalization and Deep Tokenization
Deep Learning for Search: Personalization and Deep Tokenization
Jake Mannix
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
Milo Yip
 
SPARKNaCl: A verified, fast cryptographic library
SPARKNaCl: A verified, fast cryptographic librarySPARKNaCl: A verified, fast cryptographic library
SPARKNaCl: A verified, fast cryptographic library
AdaCore
 
Scaling Machine Learning Systems up to Billions of Predictions per Day
Scaling Machine Learning Systems up to Billions of Predictions per DayScaling Machine Learning Systems up to Billions of Predictions per Day
Scaling Machine Learning Systems up to Billions of Predictions per Day
Carmine Paolino
 
(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby
Amazon Web Services
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
Ivo Andreev
 
Memcached-инъекции - они существуют и работают, Иван Новиков (ONsec)
Memcached-инъекции - они существуют и работают, Иван Новиков (ONsec)Memcached-инъекции - они существуют и работают, Иван Новиков (ONsec)
Memcached-инъекции - они существуют и работают, Иван Новиков (ONsec)
Ontico
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
Ivo Andreev
 
Toward Hybrid Cloud Serverless Transparency with Lithops Framework
Toward Hybrid Cloud Serverless Transparency with Lithops FrameworkToward Hybrid Cloud Serverless Transparency with Lithops Framework
Toward Hybrid Cloud Serverless Transparency with Lithops Framework
LibbySchulze
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
Ramesh Sampath
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Databricks
 
[JSS2015] In memory and operational analytics
[JSS2015] In memory and operational analytics[JSS2015] In memory and operational analytics
[JSS2015] In memory and operational analytics
GUSS
 

Similar to Feature Hashing for Scalable Machine Learning with Nick Pentreath (20)

Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
 
Math with .NET for you and Azure
Math with .NET for you and AzureMath with .NET for you and Azure
Math with .NET for you and Azure
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Web development basics (Part-7)
Web development basics (Part-7)Web development basics (Part-7)
Web development basics (Part-7)
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
 
Old code doesn't stink
Old code doesn't stinkOld code doesn't stink
Old code doesn't stink
 
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
 
Deep Learning for Search: Personalization and Deep Tokenization
Deep Learning for Search: Personalization and Deep TokenizationDeep Learning for Search: Personalization and Deep Tokenization
Deep Learning for Search: Personalization and Deep Tokenization
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
SPARKNaCl: A verified, fast cryptographic library
SPARKNaCl: A verified, fast cryptographic librarySPARKNaCl: A verified, fast cryptographic library
SPARKNaCl: A verified, fast cryptographic library
 
Scaling Machine Learning Systems up to Billions of Predictions per Day
Scaling Machine Learning Systems up to Billions of Predictions per DayScaling Machine Learning Systems up to Billions of Predictions per Day
Scaling Machine Learning Systems up to Billions of Predictions per Day
 
(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
 
Memcached-инъекции - они существуют и работают, Иван Новиков (ONsec)
Memcached-инъекции - они существуют и работают, Иван Новиков (ONsec)Memcached-инъекции - они существуют и работают, Иван Новиков (ONsec)
Memcached-инъекции - они существуют и работают, Иван Новиков (ONsec)
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
Toward Hybrid Cloud Serverless Transparency with Lithops Framework
Toward Hybrid Cloud Serverless Transparency with Lithops FrameworkToward Hybrid Cloud Serverless Transparency with Lithops Framework
Toward Hybrid Cloud Serverless Transparency with Lithops Framework
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
 
[JSS2015] In memory and operational analytics
[JSS2015] In memory and operational analytics[JSS2015] In memory and operational analytics
[JSS2015] In memory and operational analytics
 

More from Spark Summit

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr SzulVariant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr Szul
Spark Summit
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 

More from Spark Summit (20)

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr SzulVariant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr Szul
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
 

Recently uploaded

原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 

Recently uploaded (20)

原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 

Feature Hashing for Scalable Machine Learning with Nick Pentreath

  • 1. Nick Pentreath, IBM Feature Hashing for Scalable Machine Learning #EUds15
  • 2. About • About me – @MLnick – Principal Engineer at IBM working on machine learning & Apache Spark – Apache Spark committer & PMC – Author of Machine Learning with Spark #EUds15 2
  • 3. Agenda • Intro to feature hashing • HashingTF in Spark ML • FeatureHasher in Spark ML • Experiments • Future Work #EUds15 3
  • 5. Encoding Features 5#EUds15 2.1 3.2 −0.2 0.7 !! • Most ML algorithms operate on numeric feature vectors • Features are often categorical – even numerical features (e.g. binning continuous features)
  • 6. Encoding Features • “one-hot” encoding is popular for categorical features • “bag of words” is popular for text (or token counts more generally) 0 0 1 0 0 ! 0 2 0 1 1 #EUds15 6
  • 7. High Dimensional Features • Many domains have very high dense feature dimension (e.g. images, video) • Here we’re concerned with sparse feature domains, e.g. online ads, ecommerce, social networks, video sharing, text & NLP • Model sizes can be very large even for simple models #EUds15 7
  • 8. ! The “Hashing Trick” • Use a hash function to map feature values to indices in the feature vector Dublin Hash “city=dublin” 0 0 1 0 0 Modulo hash value to vector size to get index offeature Stock Price Hash “stock_price” 0 2.60 … 0 0 Modulo hash value to vector size to get index offeature #EUds15 8
  • 9. Feature Hashing: Pros • Fast & Simple • Preserves sparsity • Memory efficient – Limits feature vector size – No need to store mapping feature name -> index • Online learning • Easy handling of missing data • Feature engineering #EUds15 9
  • 10. Feature Hashing: Cons • No inverse mapping => cannot go from feature indices back to feature names – Interpretability & feature importances – But similar issues with other dim reduction techniques (e.g. random projections, PCA, SVD) • Hash collisions … – Impact on accuracy of feature collisions – Can use signed hash functions to alleviate part of it #EUds15 10
  • 11. HashingTF in Spark ML #EUds15 11
  • 12. HashingTF Transformer • Transforms text (sentences) -> term frequency vectors (aka “bag of words”) • Uses the “hashing trick” to compute the feature indices • Feature value is term frequency (token count) • Optional parameter to only return binary token occurrence vector #EUds15 12
  • 14. Hacking HashingTF • HashingTFcan be used for categorical features… • … but doesn’t fit neatly into Pipelines #EUds15 14
  • 15. FeatureHasher in Spark ML #EUds15 15
  • 16. FeatureHasher • Flexible, scalable feature encoding using hashing trick • Support multiple input columns (numeric or string / boolean, i.e. categorical) • One-shot feature encoder • Core logic similar to HashingTF #EUds15 16
  • 17. FeatureHasher • Operates on entire Row • … UDF with struct input #EUds15 17
  • 18. FeatureHasher #EUds15 18 • Determining feature index – Numeric: feature name – String / boolean: “feature=value” • String encoding => effectively “one hot” with indicator value of 1.0
  • 19. FeatureHasher • Modulo raw index to feature vector size • Hash collisions are additive (currently) #EUds15 19
  • 22. Text Classification • Kaggle Email Spam Dataset 0.955 0.96 0.965 0.97 0.975 0.98 0.985 0.99 10 12 14 16 18 Hash bits AUC by hash bits HashingTF CountVectorizer #EUds15 22
  • 23. Text Classification • Adding regularization (regParam=0.01) 0.97 0.975 0.98 0.985 0.99 0.995 10 12 14 16 18 Hash bits AUC by hash bits HashingTF CountVectorizer #EUds15 23
  • 24. Ad Click Prediction • Criteo Display Advertising Challenge – 45m examples, 34m features, 0.000003% sparsity • Outbrain Click Prediction – 80m examples, 15m features, 0.000007% sparsity • Criteo Terabyte Log Data – 7 day subset – 1.5b examples, 300m feature, 0.0000003% sparsity #EUds15 24
  • 25. Data • Illustrative characteristics - Criteo DAC 0 2 4 6 8 10 12 Millions Unique Values per Feature 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Feature Occurence(%) #EUds15 25
  • 26. Challenges Raw Data StringIndexer OneHotEncoder VectorAssembler OOM! • Typical one-hot encoding pipeline fails consistently with large feature dimension #EUds15 26
  • 27. Results • Compare AUC for different # hash bits 0.695 0.7 0.705 0.71 0.715 0.72 18 20 22 24 Hash bits Outbrain AUC 0.74 0.742 0.744 0.746 0.748 0.75 0.752 18 20 22 24 Hash bits Criteo DAC AUC #EUds15 27
  • 28. Results • Criteo 1T logs – 7 day subset • Can train model on 1.5b examples • 300m original features for this subset • 224 hashed features (16m) • Impossible with current Spark ML (OOM, 2Gb broadcast limit) #EUds15 28
  • 29. Summary & Future Work #EUds15 29
  • 30. Summary • Feature hashing is a fast, efficient, flexible tool for feature encoding • Can scale to high-dimensional sparse data, without giving up much accuracy • Supports multi-column “one-shot” encoding • Avoids common issues with Spark ML Pipelines using StringIndexer & OneHotEncoder at scale #EUds15 30
  • 31. Future Directions • Will be part of Spark ML in 2.3.0 release (Q4 2017) – Refer to SPARK-13969 for details • Allow users to specify set of numeric columns to be treated as categorical • Signed hash functions • Internal feature crossing & namespaces (ala Vowpal Wabbit) • DictVectorizer-like transformer => one-pass feature encoder for multiple numeric & categorical columns (with inverse mapping) – see SPARK-19962 #EUds15 31
  • 32. References • Hash Kernels • Feature Hashing for Large Scale Multitask Learning • Vowpal Wabbit • Scikit-learn #EUds15 32