SlideShare a Scribd company logo
1 of 33
Download to read offline
Nick Pentreath, IBM
Feature Hashing for
Scalable Machine
Learning
#EUds15
About
• About me
– @MLnick
– Principal Engineer at IBM working on machine
learning & Apache Spark
– Apache Spark committer & PMC
– Author of Machine Learning with Spark
#EUds15 2
Agenda
• Intro to feature hashing
• HashingTF in Spark ML
• FeatureHasher in Spark ML
• Experiments
• Future Work
#EUds15 3
Feature Hashing
4#EUds15
Encoding Features
5#EUds15
2.1 3.2 −0.2 0.7
!!
• Most ML algorithms
operate on numeric
feature vectors
• Features are often
categorical – even
numerical features (e.g.
binning continuous
features)
Encoding Features
• “one-hot” encoding is
popular for categorical
features
• “bag of words” is
popular for text (or token
counts more generally)
0 0 1 0 0
! 0 2 0 1 1
#EUds15 6
High Dimensional Features
• Many domains have very high dense feature dimension
(e.g. images, video)
• Here we’re concerned with sparse feature domains, e.g.
online ads, ecommerce, social networks, video sharing,
text & NLP
• Model sizes can be very large even for simple models
#EUds15 7
!
The “Hashing Trick”
• Use a hash function to map feature values to
indices in the feature vector
Dublin
Hash “city=dublin” 0 0 1 0 0
Modulo hash value
to vector size to get
index offeature
Stock Price
Hash “stock_price” 0 2.60 … 0 0
Modulo hash value
to vector size to get
index offeature
#EUds15 8
Feature Hashing: Pros
• Fast & Simple
• Preserves sparsity
• Memory efficient
– Limits feature vector size
– No need to store mapping feature name -> index
• Online learning
• Easy handling of missing data
• Feature engineering
#EUds15 9
Feature Hashing: Cons
• No inverse mapping => cannot go from feature
indices back to feature names
– Interpretability & feature importances
– But similar issues with other dim reduction techniques
(e.g. random projections, PCA, SVD)
• Hash collisions …
– Impact on accuracy of feature collisions
– Can use signed hash functions to alleviate part of it
#EUds15 10
HashingTF in Spark ML
#EUds15 11
HashingTF Transformer
• Transforms text (sentences) -> term frequency
vectors (aka “bag of words”)
• Uses the “hashing trick” to compute the feature
indices
• Feature value is term frequency (token count)
• Optional parameter to only return binary token
occurrence vector
#EUds15 12
HashingTF Transformer
#EUds15 13
Hacking HashingTF
• HashingTFcan be used for
categorical features…
• … but doesn’t fit neatly into
Pipelines
#EUds15 14
FeatureHasher in Spark ML
#EUds15 15
FeatureHasher
• Flexible, scalable feature encoding using
hashing trick
• Support multiple input columns (numeric or
string / boolean, i.e. categorical)
• One-shot feature encoder
• Core logic similar to HashingTF
#EUds15 16
FeatureHasher
• Operates on entire Row
• … UDF with struct input
#EUds15 17
FeatureHasher
#EUds15 18
• Determining feature
index
– Numeric: feature
name
– String / boolean:
“feature=value”
• String encoding =>
effectively “one hot”
with indicator value
of 1.0
FeatureHasher
• Modulo raw index to feature vector size
• Hash collisions are additive (currently)
#EUds15 19
FeatureHasher
#EUds15 20
Experiments
#EUds15 21
Text Classification
• Kaggle Email Spam Dataset
0.955
0.96
0.965
0.97
0.975
0.98
0.985
0.99
10 12 14 16 18
Hash bits
AUC by hash bits
HashingTF
CountVectorizer
#EUds15 22
Text Classification
• Adding regularization (regParam=0.01)
0.97
0.975
0.98
0.985
0.99
0.995
10 12 14 16 18
Hash bits
AUC by hash bits
HashingTF
CountVectorizer
#EUds15 23
Ad Click Prediction
• Criteo Display Advertising Challenge
– 45m examples, 34m features, 0.000003% sparsity
• Outbrain Click Prediction
– 80m examples, 15m features, 0.000007% sparsity
• Criteo Terabyte Log Data
– 7 day subset
– 1.5b examples, 300m feature, 0.0000003% sparsity
#EUds15 24
Data
• Illustrative characteristics - Criteo DAC
0
2
4
6
8
10
12
Millions
Unique Values per Feature
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Feature Occurence(%)
#EUds15 25
Challenges
Raw Data StringIndexer OneHotEncoder VectorAssembler
OOM!
• Typical one-hot encoding pipeline fails consistently with
large feature dimension
#EUds15 26
Results
• Compare AUC for different # hash bits
0.695
0.7
0.705
0.71
0.715
0.72
18 20 22 24
Hash bits
Outbrain
AUC
0.74
0.742
0.744
0.746
0.748
0.75
0.752
18 20 22 24
Hash bits
Criteo DAC
AUC
#EUds15 27
Results
• Criteo 1T logs – 7 day subset
• Can train model on 1.5b examples
• 300m original features for this subset
• 224 hashed features (16m)
• Impossible with current Spark ML (OOM, 2Gb
broadcast limit)
#EUds15 28
Summary & Future Work
#EUds15 29
Summary
• Feature hashing is a fast, efficient, flexible tool for
feature encoding
• Can scale to high-dimensional sparse data, without
giving up much accuracy
• Supports multi-column “one-shot” encoding
• Avoids common issues with Spark ML Pipelines
using StringIndexer & OneHotEncoder at scale
#EUds15 30
Future Directions
• Will be part of Spark ML in 2.3.0 release (Q4 2017)
– Refer to SPARK-13969 for details
• Allow users to specify set of numeric columns to be
treated as categorical
• Signed hash functions
• Internal feature crossing & namespaces (ala Vowpal
Wabbit)
• DictVectorizer-like transformer => one-pass feature
encoder for multiple numeric & categorical columns (with
inverse mapping) – see SPARK-19962
#EUds15 31
References
• Hash Kernels
• Feature Hashing for Large Scale Multitask
Learning
• Vowpal Wabbit
• Scikit-learn
#EUds15 32
Thank You.
@MLnick
spark.tc

More Related Content

What's hot

Introduction to System verilog
Introduction to System verilog Introduction to System verilog
Introduction to System verilog Pushpa Yakkala
 
Presentation on Scaling
Presentation on ScalingPresentation on Scaling
Presentation on ScalingRaviraj Kaur
 
Level sensitive scan design(LSSD) and Boundry scan(BS)
Level sensitive scan design(LSSD) and Boundry scan(BS)Level sensitive scan design(LSSD) and Boundry scan(BS)
Level sensitive scan design(LSSD) and Boundry scan(BS)Praveen Kumar
 
Rs232 485 fundamental
Rs232 485 fundamentalRs232 485 fundamental
Rs232 485 fundamentalrounak077
 
Hardware Description Language
Hardware Description Language Hardware Description Language
Hardware Description Language Prachi Pandey
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster AnalysisDerek Kane
 
Two stage op amp design on cadence
Two stage op amp design on cadenceTwo stage op amp design on cadence
Two stage op amp design on cadenceHaowei Jiang
 
CMOS Ring Oscillator Using Stacking Techniques to Reduce Power Dissipation an...
CMOS Ring Oscillator Using Stacking Techniques to Reduce Power Dissipation an...CMOS Ring Oscillator Using Stacking Techniques to Reduce Power Dissipation an...
CMOS Ring Oscillator Using Stacking Techniques to Reduce Power Dissipation an...Vikas Sah
 
Dcn a03-analog transmission
Dcn a03-analog transmissionDcn a03-analog transmission
Dcn a03-analog transmissionSaurabh Daga
 
VHDL-PRESENTATION.ppt
VHDL-PRESENTATION.pptVHDL-PRESENTATION.ppt
VHDL-PRESENTATION.pptDr.YNM
 
1 introduction to vlsi physical design
1 introduction to vlsi physical design1 introduction to vlsi physical design
1 introduction to vlsi physical designsasikun
 
Online Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkOnline Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkDavide Nardone
 
Types of Microprocessor 8085 and 8086
Types of Microprocessor 8085 and 8086Types of Microprocessor 8085 and 8086
Types of Microprocessor 8085 and 8086Rubaiet Raihan
 
Application of VLSI In Artificial Intelligence
Application of VLSI In Artificial IntelligenceApplication of VLSI In Artificial Intelligence
Application of VLSI In Artificial IntelligenceIOSR Journals
 
minimisation of crosstalk in VLSI routing
minimisation of crosstalk in VLSI routingminimisation of crosstalk in VLSI routing
minimisation of crosstalk in VLSI routingChandrajit Pal
 

What's hot (20)

Introduction to System verilog
Introduction to System verilog Introduction to System verilog
Introduction to System verilog
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Presentation on Scaling
Presentation on ScalingPresentation on Scaling
Presentation on Scaling
 
Level sensitive scan design(LSSD) and Boundry scan(BS)
Level sensitive scan design(LSSD) and Boundry scan(BS)Level sensitive scan design(LSSD) and Boundry scan(BS)
Level sensitive scan design(LSSD) and Boundry scan(BS)
 
Oscillatorsppt
OscillatorspptOscillatorsppt
Oscillatorsppt
 
Rs232 485 fundamental
Rs232 485 fundamentalRs232 485 fundamental
Rs232 485 fundamental
 
Need of Decoupling Capacitor
Need of Decoupling CapacitorNeed of Decoupling Capacitor
Need of Decoupling Capacitor
 
Kernighan lin
Kernighan linKernighan lin
Kernighan lin
 
Hardware Description Language
Hardware Description Language Hardware Description Language
Hardware Description Language
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
 
Two stage op amp design on cadence
Two stage op amp design on cadenceTwo stage op amp design on cadence
Two stage op amp design on cadence
 
CMOS Ring Oscillator Using Stacking Techniques to Reduce Power Dissipation an...
CMOS Ring Oscillator Using Stacking Techniques to Reduce Power Dissipation an...CMOS Ring Oscillator Using Stacking Techniques to Reduce Power Dissipation an...
CMOS Ring Oscillator Using Stacking Techniques to Reduce Power Dissipation an...
 
Dcn a03-analog transmission
Dcn a03-analog transmissionDcn a03-analog transmission
Dcn a03-analog transmission
 
VHDL-PRESENTATION.ppt
VHDL-PRESENTATION.pptVHDL-PRESENTATION.ppt
VHDL-PRESENTATION.ppt
 
1 introduction to vlsi physical design
1 introduction to vlsi physical design1 introduction to vlsi physical design
1 introduction to vlsi physical design
 
Folded cascode1
Folded cascode1Folded cascode1
Folded cascode1
 
Online Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkOnline Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache Spark
 
Types of Microprocessor 8085 and 8086
Types of Microprocessor 8085 and 8086Types of Microprocessor 8085 and 8086
Types of Microprocessor 8085 and 8086
 
Application of VLSI In Artificial Intelligence
Application of VLSI In Artificial IntelligenceApplication of VLSI In Artificial Intelligence
Application of VLSI In Artificial Intelligence
 
minimisation of crosstalk in VLSI routing
minimisation of crosstalk in VLSI routingminimisation of crosstalk in VLSI routing
minimisation of crosstalk in VLSI routing
 

Similar to Feature Hashing for Scalable Machine Learning with Nick Pentreath

Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Spark Summit
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit
 
Math with .NET for you and Azure
Math with .NET for you and AzureMath with .NET for you and Azure
Math with .NET for you and AzureMarco Parenzan
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Web development basics (Part-7)
Web development basics (Part-7)Web development basics (Part-7)
Web development basics (Part-7)Rajat Pratap Singh
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...Dataconomy Media
 
Deep Learning for Search: Personalization and Deep Tokenization
Deep Learning for Search: Personalization and Deep TokenizationDeep Learning for Search: Personalization and Deep Tokenization
Deep Learning for Search: Personalization and Deep TokenizationJake Mannix
 
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...Lucidworks
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldMilo Yip
 
SPARKNaCl: A verified, fast cryptographic library
SPARKNaCl: A verified, fast cryptographic librarySPARKNaCl: A verified, fast cryptographic library
SPARKNaCl: A verified, fast cryptographic libraryAdaCore
 
Scaling Machine Learning Systems up to Billions of Predictions per Day
Scaling Machine Learning Systems up to Billions of Predictions per DayScaling Machine Learning Systems up to Billions of Predictions per Day
Scaling Machine Learning Systems up to Billions of Predictions per DayCarmine Paolino
 
(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in RubyAmazon Web Services
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxIvo Andreev
 
Memcached-инъекции - они существуют и работают, Иван Новиков (ONsec)
Memcached-инъекции - они существуют и работают, Иван Новиков (ONsec)Memcached-инъекции - они существуют и работают, Иван Новиков (ONsec)
Memcached-инъекции - они существуют и работают, Иван Новиков (ONsec)Ontico
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?Ivo Andreev
 
Toward Hybrid Cloud Serverless Transparency with Lithops Framework
Toward Hybrid Cloud Serverless Transparency with Lithops FrameworkToward Hybrid Cloud Serverless Transparency with Lithops Framework
Toward Hybrid Cloud Serverless Transparency with Lithops FrameworkLibbySchulze
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelinesRamesh Sampath
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDatabricks
 
Jss 2015 in memory and operational analytics
Jss 2015   in memory and operational analyticsJss 2015   in memory and operational analytics
Jss 2015 in memory and operational analyticsDavid Barbarin
 

Similar to Feature Hashing for Scalable Machine Learning with Nick Pentreath (20)

Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
 
Math with .NET for you and Azure
Math with .NET for you and AzureMath with .NET for you and Azure
Math with .NET for you and Azure
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Web development basics (Part-7)
Web development basics (Part-7)Web development basics (Part-7)
Web development basics (Part-7)
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
 
Old code doesn't stink
Old code doesn't stinkOld code doesn't stink
Old code doesn't stink
 
Deep Learning for Search: Personalization and Deep Tokenization
Deep Learning for Search: Personalization and Deep TokenizationDeep Learning for Search: Personalization and Deep Tokenization
Deep Learning for Search: Personalization and Deep Tokenization
 
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
Deep Learning for Unified Personalized Search and Recommendations - Jake Mann...
 
How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
SPARKNaCl: A verified, fast cryptographic library
SPARKNaCl: A verified, fast cryptographic librarySPARKNaCl: A verified, fast cryptographic library
SPARKNaCl: A verified, fast cryptographic library
 
Scaling Machine Learning Systems up to Billions of Predictions per Day
Scaling Machine Learning Systems up to Billions of Predictions per DayScaling Machine Learning Systems up to Billions of Predictions per Day
Scaling Machine Learning Systems up to Billions of Predictions per Day
 
(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby(DEV309) Large-Scale Metrics Analysis in Ruby
(DEV309) Large-Scale Metrics Analysis in Ruby
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
 
Memcached-инъекции - они существуют и работают, Иван Новиков (ONsec)
Memcached-инъекции - они существуют и работают, Иван Новиков (ONsec)Memcached-инъекции - они существуют и работают, Иван Новиков (ONsec)
Memcached-инъекции - они существуют и работают, Иван Новиков (ONsec)
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
Toward Hybrid Cloud Serverless Transparency with Lithops Framework
Toward Hybrid Cloud Serverless Transparency with Lithops FrameworkToward Hybrid Cloud Serverless Transparency with Lithops Framework
Toward Hybrid Cloud Serverless Transparency with Lithops Framework
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
 
Jss 2015 in memory and operational analytics
Jss 2015   in memory and operational analyticsJss 2015   in memory and operational analytics
Jss 2015 in memory and operational analytics
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 

Recently uploaded (20)

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 

Feature Hashing for Scalable Machine Learning with Nick Pentreath

  • 1. Nick Pentreath, IBM Feature Hashing for Scalable Machine Learning #EUds15
  • 2. About • About me – @MLnick – Principal Engineer at IBM working on machine learning & Apache Spark – Apache Spark committer & PMC – Author of Machine Learning with Spark #EUds15 2
  • 3. Agenda • Intro to feature hashing • HashingTF in Spark ML • FeatureHasher in Spark ML • Experiments • Future Work #EUds15 3
  • 5. Encoding Features 5#EUds15 2.1 3.2 −0.2 0.7 !! • Most ML algorithms operate on numeric feature vectors • Features are often categorical – even numerical features (e.g. binning continuous features)
  • 6. Encoding Features • “one-hot” encoding is popular for categorical features • “bag of words” is popular for text (or token counts more generally) 0 0 1 0 0 ! 0 2 0 1 1 #EUds15 6
  • 7. High Dimensional Features • Many domains have very high dense feature dimension (e.g. images, video) • Here we’re concerned with sparse feature domains, e.g. online ads, ecommerce, social networks, video sharing, text & NLP • Model sizes can be very large even for simple models #EUds15 7
  • 8. ! The “Hashing Trick” • Use a hash function to map feature values to indices in the feature vector Dublin Hash “city=dublin” 0 0 1 0 0 Modulo hash value to vector size to get index offeature Stock Price Hash “stock_price” 0 2.60 … 0 0 Modulo hash value to vector size to get index offeature #EUds15 8
  • 9. Feature Hashing: Pros • Fast & Simple • Preserves sparsity • Memory efficient – Limits feature vector size – No need to store mapping feature name -> index • Online learning • Easy handling of missing data • Feature engineering #EUds15 9
  • 10. Feature Hashing: Cons • No inverse mapping => cannot go from feature indices back to feature names – Interpretability & feature importances – But similar issues with other dim reduction techniques (e.g. random projections, PCA, SVD) • Hash collisions … – Impact on accuracy of feature collisions – Can use signed hash functions to alleviate part of it #EUds15 10
  • 11. HashingTF in Spark ML #EUds15 11
  • 12. HashingTF Transformer • Transforms text (sentences) -> term frequency vectors (aka “bag of words”) • Uses the “hashing trick” to compute the feature indices • Feature value is term frequency (token count) • Optional parameter to only return binary token occurrence vector #EUds15 12
  • 14. Hacking HashingTF • HashingTFcan be used for categorical features… • … but doesn’t fit neatly into Pipelines #EUds15 14
  • 15. FeatureHasher in Spark ML #EUds15 15
  • 16. FeatureHasher • Flexible, scalable feature encoding using hashing trick • Support multiple input columns (numeric or string / boolean, i.e. categorical) • One-shot feature encoder • Core logic similar to HashingTF #EUds15 16
  • 17. FeatureHasher • Operates on entire Row • … UDF with struct input #EUds15 17
  • 18. FeatureHasher #EUds15 18 • Determining feature index – Numeric: feature name – String / boolean: “feature=value” • String encoding => effectively “one hot” with indicator value of 1.0
  • 19. FeatureHasher • Modulo raw index to feature vector size • Hash collisions are additive (currently) #EUds15 19
  • 22. Text Classification • Kaggle Email Spam Dataset 0.955 0.96 0.965 0.97 0.975 0.98 0.985 0.99 10 12 14 16 18 Hash bits AUC by hash bits HashingTF CountVectorizer #EUds15 22
  • 23. Text Classification • Adding regularization (regParam=0.01) 0.97 0.975 0.98 0.985 0.99 0.995 10 12 14 16 18 Hash bits AUC by hash bits HashingTF CountVectorizer #EUds15 23
  • 24. Ad Click Prediction • Criteo Display Advertising Challenge – 45m examples, 34m features, 0.000003% sparsity • Outbrain Click Prediction – 80m examples, 15m features, 0.000007% sparsity • Criteo Terabyte Log Data – 7 day subset – 1.5b examples, 300m feature, 0.0000003% sparsity #EUds15 24
  • 25. Data • Illustrative characteristics - Criteo DAC 0 2 4 6 8 10 12 Millions Unique Values per Feature 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Feature Occurence(%) #EUds15 25
  • 26. Challenges Raw Data StringIndexer OneHotEncoder VectorAssembler OOM! • Typical one-hot encoding pipeline fails consistently with large feature dimension #EUds15 26
  • 27. Results • Compare AUC for different # hash bits 0.695 0.7 0.705 0.71 0.715 0.72 18 20 22 24 Hash bits Outbrain AUC 0.74 0.742 0.744 0.746 0.748 0.75 0.752 18 20 22 24 Hash bits Criteo DAC AUC #EUds15 27
  • 28. Results • Criteo 1T logs – 7 day subset • Can train model on 1.5b examples • 300m original features for this subset • 224 hashed features (16m) • Impossible with current Spark ML (OOM, 2Gb broadcast limit) #EUds15 28
  • 29. Summary & Future Work #EUds15 29
  • 30. Summary • Feature hashing is a fast, efficient, flexible tool for feature encoding • Can scale to high-dimensional sparse data, without giving up much accuracy • Supports multi-column “one-shot” encoding • Avoids common issues with Spark ML Pipelines using StringIndexer & OneHotEncoder at scale #EUds15 30
  • 31. Future Directions • Will be part of Spark ML in 2.3.0 release (Q4 2017) – Refer to SPARK-13969 for details • Allow users to specify set of numeric columns to be treated as categorical • Signed hash functions • Internal feature crossing & namespaces (ala Vowpal Wabbit) • DictVectorizer-like transformer => one-pass feature encoder for multiple numeric & categorical columns (with inverse mapping) – see SPARK-19962 #EUds15 31
  • 32. References • Hash Kernels • Feature Hashing for Large Scale Multitask Learning • Vowpal Wabbit • Scikit-learn #EUds15 32