SlideShare a Scribd company logo
1 of 33
FEATURE ENGINEERING
FOR DIVERSE DATA TYPES
Alice Zheng
October 10, 2016
Seattle PyLadies Meetup
1
2
MY JOURNEY SO FAR
Shortage of expertise and
good tools in the market.
Applied machine learning/
data science
Build ML tools
Write a book
3
MACHINE LEARNING IS USEFUL!
Model data.
Make predictions.
Build intelligent
applications.
Play chess and go!
4
THE MACHINE LEARNING PIPELINE
It is a puppy and
it is extremely
cute.
Raw data
Features
Models
Predictions
Deploy in
production
Models
6
A SIMPLE MODEL
X
Y
X and Y
1
1
1
0
0
0
0 1
1
0 0 0
f(x, y) = 0.5 x + 0.5 y – 1 g(x, y) =
1 if f(x, y) > 0
0 if f(x, y) <= 0
7
VISUALIZING A MODEL
1
1
X
Y
g(x,y)
0
8
FROM SIMPLE TO COMPLEX
Xn
X3
X2
X1
…
r1(X1, X2)
r2(X2∪X3)
rm(X1, Xn)
…
s1(r1, r2)
s2(r1, r3)
sm(rm-1, rm)
…
Use more complicated functions
or
Stack layers of simple functions
(e.g., deep neural nets)
9
BETWEEN RAW DATA AND MODELS
• Mathematical models take numeric input
• Raw data are not numeric (or not the right kind of numeric)
• Featurization: the step in-between
• Feature space: multi-dimensional numeric space where modeling happens
Feature Generation
Feature: An individual measurable property
of a phenomenon being observed.
⎯ Christopher Bishop,
“Pattern Recognition and Machine Learning”
TEXT
12
TURNING TEXT INTO FEATURES
It is a puppy and it
is extremely cute.
What are the important
measures? Keywords?
Verb tense? Subject,
object?
it 2
is 2
puppy 1
and 1
cat 0
aardvark 0
cute 1
extremely 1
… …
Bag of words feature
vector
Raw text
13
VISUALIZING BAG-OF-WORDS
puppy
cute
1
1
It is a puppy and
it is extremely cute
14
CLASSIFYING BAG-OF-WORDS
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
I have a dog
and I have a pen
1
Decision surface
Feature Cleaning and Transformation
16
AUTO-GENERATED FEATURES
ARE NOISY
Rank Word Doc Count Rank Word Doc Count
1 the 1,416,058 11 was 929,703
2 and 1,381,324 12 this 844,824
3 a 1,263,126 13 but 822,313
4 i 1,230,214 14 my 786,595
5 to 1,196,238 15 that 777,045
6 it 1,027,835 16 with 775,044
7 of 1,025,638 17 on 735,419
8 for 993,430 18 they 720,994
9 is 988,547 19 you 701,015
10 in 961,518 20 have 692,749
Most popular words in Yelp reviews dataset (~ 6M reviews).
17
AUTO-GENERATED FEATURES
ARE NOISY
Rank Word Doc
Count
Rank Word Doc
Count
357,480 cmtk8xyqg 1 357,470 attractif 1
357,479 tangified 1 357,469 chappagetti 1
357,478 laaaaaaasts 1 357,468 herdy 1
357,477 bailouts 1 357,467 csmpus 1
357,476 feautred 1 357,466 costoso 1
357,475 résine 1 357,465 freebased 1
357,474 chilyl 1 357,464 tikme 1
357,473 cariottis 1 357,463 traditionresort 1
357,472 enfeebled 1 357,462 jallisco 1
357,471 sparklely 1 357,461 zoawan 1
Least popular words in Yelp reviews dataset (~ 6M reviews).
18
FEATURE CLEANING
• Popular words and rare words are not helpful
• Manually defined blacklist – stopwords
a b c d e f g h i
able be came definitely each far get had ie
about became can described edu few gets happens if
above because cannot despite eg fifth getting hardly ignored
according become cant did eight first given has immediately
accordingly becomes cause different either five gives have in
across becoming causes do else followed go having inasmuch
… … … … … … … … …
19
FEATURE CLEANING
• Frequency-based pruning
20
STOPWORDS VS. FREQUENCY
FILTERS
No training required
Stopwords Frequency filters
Can be exhaustive
Inflexible
Adapts to data
Also deals with rare words
Needs tuning, hard to control
Both require manual attention
21
FEATURE SCALING WITH TD-IDF
• Scaling ”evens out” the features
• A soft filter
• Tf-idf = term frequency x inverse document frequency
• Tf = Number of times a terms appears in a document
• Idf = log(# total docs / # docs containing word w)
• Large for uncommon words, small for popular words
• Discounts popular words, highlights rare words
22
VISUALIZING TF-IDF
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
idf(puppy) = log 4
idf(cat) = log 4
idf(have) = log 1 = 0
I have a dog
and I have a pen
1
23
VISUALIZING TF-IDF
puppy
cat1
have
tfidf(puppy) = log 4
tfidf(cat) = log 4
tfidf(have) = 0
I have a dog
and I have a pen,
I have a kitten
1
log 4
log 4
I have a cat
I have a puppy
IMAGES
25
REPRESENTING IMAGES
What are the “semantic atoms” of images?
• Semantic atom = a unit of meaning
26
COLOR HISTOGRAM
40%
60%
White Blue
40%
60%
White Blue
27
INFORMATION ABOUT STRUCTURE
Collection of local patches encapsulates global structure
28
IMAGE GRADIENTS AND
ORIENTATION HISTOGRAM
• Color changes indicate edges, patterns, or
texture
• Image gradient: direction of largest change in
color, starting from a pixel
-45º
0º
45º
-90º
90º
135º
180º
-135º
• Gradient orientation histogram: indicates the
prominent directions of color change in a
patch of pixels
29
SIFT IMAGE FEATURE PIPELINE
Lowe, ICCV 1999
30
DEEP LEARNING APPROACH
• Stack multiple layers – combine local features to form global features
• Similar in spirit to SIFT/HOG
“AlexNet” – Krizhevsky et al., NIPS 2012
31
VISUALIZING ALEXNET
Weights of a trained AlexNet. Left– first layer, right – second layer.
32
FEATURIZATION CHALLENGES
It is a puppy and it is
extremely cute.
“Human native” Conceptually abstract
Low Semantic content in data High
Higher Difficulty of feature generation Lower
Text
ImageAudio
33
KEY TO FEATURE ENGINEERING
• Features sit in-between data and models
• Need to encapsulate necessary semantic information from raw data
• Distribution of data in feature space should be easily manageable by intended
model
• Natural text and logs contain higher level semantic information
• Easier to featurize than images and audio
• Requires ingenuity and intuition!
@RainyData alicez@amazon.com
Amazon Ad Platform is hiring!

More Related Content

What's hot

1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
NAVER Engineering
 

What's hot (14)

Generative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their ApplicationsGenerative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their Applications
 
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
 
Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)
 
Generative Adversarial Networks and Their Applications in Medical Imaging
Generative Adversarial Networks  and Their Applications in Medical ImagingGenerative Adversarial Networks  and Their Applications in Medical Imaging
Generative Adversarial Networks and Their Applications in Medical Imaging
 
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
 
Convolutional neural network in practice
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice
 
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI mission
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI missionIlya Sutskever at AI Frontiers : Progress towards the OpenAI mission
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI mission
 
Variants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooVariants of GANs - Jaejun Yoo
Variants of GANs - Jaejun Yoo
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial Networks
 
Generative Adversarial Network and its Applications to Speech Processing an...
Generative Adversarial Network and its Applications to Speech Processing an...Generative Adversarial Network and its Applications to Speech Processing an...
Generative Adversarial Network and its Applications to Speech Processing an...
 
Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...
 

Viewers also liked

Yug Contract Company Digital Descent Autumn 2010
Yug Contract Company Digital Descent Autumn 2010Yug Contract Company Digital Descent Autumn 2010
Yug Contract Company Digital Descent Autumn 2010
Yug Contract Company
 
Science presentation
Science presentationScience presentation
Science presentation
sams01
 
Introduction &amp; EHR Benefits Realization
Introduction &amp; EHR Benefits RealizationIntroduction &amp; EHR Benefits Realization
Introduction &amp; EHR Benefits Realization
Dave Shiple
 
Enterprise mHealth Strategy
Enterprise mHealth StrategyEnterprise mHealth Strategy
Enterprise mHealth Strategy
Dave Shiple
 
@ UDRI - Traffic & Transportation Plan - Final
@ UDRI - Traffic & Transportation Plan - Final@ UDRI - Traffic & Transportation Plan - Final
@ UDRI - Traffic & Transportation Plan - Final
Altamash Khan
 

Viewers also liked (20)

Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature Engineering
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Understanding Feature Space in Machine Learning
Understanding Feature Space in Machine LearningUnderstanding Feature Space in Machine Learning
Understanding Feature Space in Machine Learning
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoost
 
Yug Contract Company Digital Descent Autumn 2010
Yug Contract Company Digital Descent Autumn 2010Yug Contract Company Digital Descent Autumn 2010
Yug Contract Company Digital Descent Autumn 2010
 
Science presentation
Science presentationScience presentation
Science presentation
 
Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks
 
Introduction &amp; EHR Benefits Realization
Introduction &amp; EHR Benefits RealizationIntroduction &amp; EHR Benefits Realization
Introduction &amp; EHR Benefits Realization
 
What the Bleep is Big Data? A Holistic View of Data and Algorithms
What the Bleep is Big Data? A Holistic View of Data and AlgorithmsWhat the Bleep is Big Data? A Holistic View of Data and Algorithms
What the Bleep is Big Data? A Holistic View of Data and Algorithms
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
Deep Learning in Natural Language Processing
Deep Learning in Natural Language ProcessingDeep Learning in Natural Language Processing
Deep Learning in Natural Language Processing
 
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaBig Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
 
Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...
 
Enterprise mHealth Strategy
Enterprise mHealth StrategyEnterprise mHealth Strategy
Enterprise mHealth Strategy
 
@ UDRI - Traffic & Transportation Plan - Final
@ UDRI - Traffic & Transportation Plan - Final@ UDRI - Traffic & Transportation Plan - Final
@ UDRI - Traffic & Transportation Plan - Final
 
Visualising Multi Dimensional Data
Visualising Multi Dimensional DataVisualising Multi Dimensional Data
Visualising Multi Dimensional Data
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Amazon Machine Learning: Empowering Developers to Build Smart Applications
Amazon Machine Learning: Empowering Developers to Build Smart ApplicationsAmazon Machine Learning: Empowering Developers to Build Smart Applications
Amazon Machine Learning: Empowering Developers to Build Smart Applications
 

Similar to Feature engineering for diverse data types

Similar to Feature engineering for diverse data types (20)

Introduction to ML and Decision Tree
Introduction to ML and Decision TreeIntroduction to ML and Decision Tree
Introduction to ML and Decision Tree
 
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
 
Data science in action
Data science in actionData science in action
Data science in action
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017
 
Giving Technical Presentations
Giving Technical PresentationsGiving Technical Presentations
Giving Technical Presentations
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 
A Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree EnsemblesA Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree Ensembles
 
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2
 
Data generation, the hard parts
Data generation, the hard partsData generation, the hard parts
Data generation, the hard parts
 
A Deep Journey into Playing Games with Reinforcement Learning - Kim Hammar
A Deep Journey into Playing Games with Reinforcement Learning - Kim HammarA Deep Journey into Playing Games with Reinforcement Learning - Kim Hammar
A Deep Journey into Playing Games with Reinforcement Learning - Kim Hammar
 
Geo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsGeo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic Maps
 
Understanding feature-space
Understanding feature-spaceUnderstanding feature-space
Understanding feature-space
 
Making AI efficient
Making AI efficientMaking AI efficient
Making AI efficient
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigData
 
Deep learning
Deep learningDeep learning
Deep learning
 
関数プログラミングことはじめ in 福岡
関数プログラミングことはじめ in 福岡関数プログラミングことはじめ in 福岡
関数プログラミングことはじめ in 福岡
 
It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.
 
TRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchTRECVID 2016 : Instance Search
TRECVID 2016 : Instance Search
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
 

Recently uploaded

THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Cherry
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Cherry
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cherry
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Cherry
 

Recently uploaded (20)

THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
Concept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdfConcept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdf
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNA
 

Feature engineering for diverse data types

  • 1. FEATURE ENGINEERING FOR DIVERSE DATA TYPES Alice Zheng October 10, 2016 Seattle PyLadies Meetup 1
  • 2. 2 MY JOURNEY SO FAR Shortage of expertise and good tools in the market. Applied machine learning/ data science Build ML tools Write a book
  • 3. 3 MACHINE LEARNING IS USEFUL! Model data. Make predictions. Build intelligent applications. Play chess and go!
  • 4. 4 THE MACHINE LEARNING PIPELINE It is a puppy and it is extremely cute. Raw data Features Models Predictions Deploy in production
  • 6. 6 A SIMPLE MODEL X Y X and Y 1 1 1 0 0 0 0 1 1 0 0 0 f(x, y) = 0.5 x + 0.5 y – 1 g(x, y) = 1 if f(x, y) > 0 0 if f(x, y) <= 0
  • 8. 8 FROM SIMPLE TO COMPLEX Xn X3 X2 X1 … r1(X1, X2) r2(X2∪X3) rm(X1, Xn) … s1(r1, r2) s2(r1, r3) sm(rm-1, rm) … Use more complicated functions or Stack layers of simple functions (e.g., deep neural nets)
  • 9. 9 BETWEEN RAW DATA AND MODELS • Mathematical models take numeric input • Raw data are not numeric (or not the right kind of numeric) • Featurization: the step in-between • Feature space: multi-dimensional numeric space where modeling happens
  • 10. Feature Generation Feature: An individual measurable property of a phenomenon being observed. ⎯ Christopher Bishop, “Pattern Recognition and Machine Learning”
  • 11. TEXT
  • 12. 12 TURNING TEXT INTO FEATURES It is a puppy and it is extremely cute. What are the important measures? Keywords? Verb tense? Subject, object? it 2 is 2 puppy 1 and 1 cat 0 aardvark 0 cute 1 extremely 1 … … Bag of words feature vector Raw text
  • 13. 13 VISUALIZING BAG-OF-WORDS puppy cute 1 1 It is a puppy and it is extremely cute
  • 14. 14 CLASSIFYING BAG-OF-WORDS puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten I have a dog and I have a pen 1 Decision surface
  • 15. Feature Cleaning and Transformation
  • 16. 16 AUTO-GENERATED FEATURES ARE NOISY Rank Word Doc Count Rank Word Doc Count 1 the 1,416,058 11 was 929,703 2 and 1,381,324 12 this 844,824 3 a 1,263,126 13 but 822,313 4 i 1,230,214 14 my 786,595 5 to 1,196,238 15 that 777,045 6 it 1,027,835 16 with 775,044 7 of 1,025,638 17 on 735,419 8 for 993,430 18 they 720,994 9 is 988,547 19 you 701,015 10 in 961,518 20 have 692,749 Most popular words in Yelp reviews dataset (~ 6M reviews).
  • 17. 17 AUTO-GENERATED FEATURES ARE NOISY Rank Word Doc Count Rank Word Doc Count 357,480 cmtk8xyqg 1 357,470 attractif 1 357,479 tangified 1 357,469 chappagetti 1 357,478 laaaaaaasts 1 357,468 herdy 1 357,477 bailouts 1 357,467 csmpus 1 357,476 feautred 1 357,466 costoso 1 357,475 résine 1 357,465 freebased 1 357,474 chilyl 1 357,464 tikme 1 357,473 cariottis 1 357,463 traditionresort 1 357,472 enfeebled 1 357,462 jallisco 1 357,471 sparklely 1 357,461 zoawan 1 Least popular words in Yelp reviews dataset (~ 6M reviews).
  • 18. 18 FEATURE CLEANING • Popular words and rare words are not helpful • Manually defined blacklist – stopwords a b c d e f g h i able be came definitely each far get had ie about became can described edu few gets happens if above because cannot despite eg fifth getting hardly ignored according become cant did eight first given has immediately accordingly becomes cause different either five gives have in across becoming causes do else followed go having inasmuch … … … … … … … … …
  • 20. 20 STOPWORDS VS. FREQUENCY FILTERS No training required Stopwords Frequency filters Can be exhaustive Inflexible Adapts to data Also deals with rare words Needs tuning, hard to control Both require manual attention
  • 21. 21 FEATURE SCALING WITH TD-IDF • Scaling ”evens out” the features • A soft filter • Tf-idf = term frequency x inverse document frequency • Tf = Number of times a terms appears in a document • Idf = log(# total docs / # docs containing word w) • Large for uncommon words, small for popular words • Discounts popular words, highlights rare words
  • 22. 22 VISUALIZING TF-IDF puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten idf(puppy) = log 4 idf(cat) = log 4 idf(have) = log 1 = 0 I have a dog and I have a pen 1
  • 23. 23 VISUALIZING TF-IDF puppy cat1 have tfidf(puppy) = log 4 tfidf(cat) = log 4 tfidf(have) = 0 I have a dog and I have a pen, I have a kitten 1 log 4 log 4 I have a cat I have a puppy
  • 25. 25 REPRESENTING IMAGES What are the “semantic atoms” of images? • Semantic atom = a unit of meaning
  • 27. 27 INFORMATION ABOUT STRUCTURE Collection of local patches encapsulates global structure
  • 28. 28 IMAGE GRADIENTS AND ORIENTATION HISTOGRAM • Color changes indicate edges, patterns, or texture • Image gradient: direction of largest change in color, starting from a pixel -45º 0º 45º -90º 90º 135º 180º -135º • Gradient orientation histogram: indicates the prominent directions of color change in a patch of pixels
  • 29. 29 SIFT IMAGE FEATURE PIPELINE Lowe, ICCV 1999
  • 30. 30 DEEP LEARNING APPROACH • Stack multiple layers – combine local features to form global features • Similar in spirit to SIFT/HOG “AlexNet” – Krizhevsky et al., NIPS 2012
  • 31. 31 VISUALIZING ALEXNET Weights of a trained AlexNet. Left– first layer, right – second layer.
  • 32. 32 FEATURIZATION CHALLENGES It is a puppy and it is extremely cute. “Human native” Conceptually abstract Low Semantic content in data High Higher Difficulty of feature generation Lower Text ImageAudio
  • 33. 33 KEY TO FEATURE ENGINEERING • Features sit in-between data and models • Need to encapsulate necessary semantic information from raw data • Distribution of data in feature space should be easily manageable by intended model • Natural text and logs contain higher level semantic information • Easier to featurize than images and audio • Requires ingenuity and intuition! @RainyData alicez@amazon.com Amazon Ad Platform is hiring!

Editor's Notes

  1. Features sit between raw data and model. They can make or break an application.