SlideShare a Scribd company logo
1 of 29
Download to read offline
Feature Engineering for
Machine Learning
Amanda Casari
Principal Product Manager + Data Scientist
Concur Labs @ SAP Concur
@amcasari
here to there via random walk
product + data
@ SAP Concur
control systems
engineering +
robotics + legos
officer in US Navy
operations research
analyst
wandering dirtbag +
conservation volunteer
EE + applied math
+ complex systems
underwater robotics
consultant
extraordinaire
stay at home mom
co-author
NASA Datanaut
@amcasarihere to there via random walk
data science is not magic…
@amcasari
…but it is a process (sometimes painful)
@amcasari@MROGATI
it is easy to get turned around….
@amcasari
idea
research
exploration
hypotheses
model
outcomes
feedback
…and it is easy to get mixed up
xkcd #1838
@amcasari
…so let’s focus on getting from data to models
feature engineering goes here!
@amcasari
when we say…
DATA SCIENCE
• …. the interdisciplinary intersection of methods, processes,
algorithms and problem solving techniques to extract
knowledge from data1
MACHINE LEARNING [ML]
§ …. fitting mathematical models to data in order to
derive insights or make predictions.2
FEATURE
§ …. a numeric representation of an aspect of raw data2
FEATURE ENGINEERING
§ …. the act of extracting features from raw data and
transforming them into formats that are suitable for the
machine learning model2
hint: our community is well represented in Wikipedia @amcasari
[n.b. ethics]
DATA
SOCIAL CONSTRUCT
§ …. “jointly constructed understandings of the world that
form the basis for shared assumptions about reality”1
BIAS
§ … results from unfair sampling of a population, or from an
estimation process that does not give accurate results on
average2
ACCOUNTABILITY
§ … you are answerable for your decisions and obligated
to be able to explain the resulting consequences3
hint: much more about this w/ @kjam at 14:30
§ …. is an abstract representation of reality, not reality itself.
Data is a part of the system of record, but not the actual
system itself.
@amcasari
how to choose?
1/ FRAME YOUR PROBLEM
2/ UNDERSTAND YOUR DATA
§ What data will be most helpful to understand and
generate a better understanding of this problem?
3/ FRAME YOUR FEATURE GOALS
§ What are you optimizing for?
§ Iteration speed
§ Model performance
4 / TEST, ITERATE, TEST AGAIN
§ Check your choices for robustness
§ Validate but realize this will still change
§ Can you frame your problem in a way that machine
learning could be useful? e.g. prediction
@amcasari
vector space
scalar: single numeric
feature
vector: ordered list of
scalars
Example:
1/ two-dimensional
vector, v = [1, -1]
@amcasari
feature space
In data, abstract vectors
take on actual meaning
Examples:
• 1/ a vector can
represent a person’s
preference for songs
• Song = feature
• +1: Thumbs-up
• -1: Thumbs-down
• 2/ song represents ind.
preferences in a group
@amcasari
Counts: Fancy Tricks with Simple Numbers
counts: binarization
@amcasari
counts: binning
@amcasari
counts: fixed width binning
@amcasari
@amcasari
counts: adaptive binning
@amcasari
loga(ax) = x, where a is a positive
constant and x can be any positive
number
a0=1, loga(1)=0
tl;dr
the log function compresses the
range of large numbers and
expands the range of small numbers
counts:
log transform binning
@amcasari
What does scaling do for features?
normalization: feature scaling
@amcasari
@amcasari
normalization: feature scaling
@amcasari
normalization: feature scaling
@amcasari
proper scaling preserves underlying shape
Text: Flatten, Filter, Chunk
why text?
@amcasari
hedonometer.org
flatten: bag-of-words (BoW)
@amcasari
filter: frequency based filtering (stopwords)
@amcasari
These NLP libraries
have both English +
Portuguese
corpora, models,
etc
1/ spacy
2/ NLTK
3/ OpenNLP
chunk: parts of speech matter
@amcasari
Pop Chart Lab, npr.org
@amcasari
thank you
@RainyData
code repobuy the book here!

More Related Content

What's hot

Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introductionAnas Jamil
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014StampedeCon
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learningshivani saluja
 
[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016Grigoris C
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast ReviewAhmad Ali Abin
 
Le Machine Learning de A à Z
Le Machine Learning de A à ZLe Machine Learning de A à Z
Le Machine Learning de A à ZAlexia Audevart
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Machine learning
Machine learningMachine learning
Machine learningeonx_32
 
What is Machine Learning?
What is Machine Learning?What is Machine Learning?
What is Machine Learning?SwiftKeyComms
 
Generative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their ApplicationsGenerative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their ApplicationsArtifacia
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningShahar Cohen
 
Simple overview of machine learning
Simple overview of machine learningSimple overview of machine learning
Simple overview of machine learningpriyadharshini R
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2Gokulks007
 
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...Natalia Díaz Rodríguez
 
Machine Learning - Supervised learning
Machine Learning - Supervised learningMachine Learning - Supervised learning
Machine Learning - Supervised learningManeesha Caldera
 
Brief introduction to Machine Learning
Brief introduction to Machine LearningBrief introduction to Machine Learning
Brief introduction to Machine LearningCodeForFrankfurt
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for EveryoneAly Abdelkareem
 
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Deep Learning For Practitioners,  lecture 2: Selecting the right applications...Deep Learning For Practitioners,  lecture 2: Selecting the right applications...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...ananth
 
Word vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlmWord vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlmhyunsung lee
 

What's hot (20)

Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast Review
 
Le Machine Learning de A à Z
Le Machine Learning de A à ZLe Machine Learning de A à Z
Le Machine Learning de A à Z
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
What is Machine Learning?
What is Machine Learning?What is Machine Learning?
What is Machine Learning?
 
Generative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their ApplicationsGenerative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their Applications
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Simple overview of machine learning
Simple overview of machine learningSimple overview of machine learning
Simple overview of machine learning
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...
 
Machine Learning - Supervised learning
Machine Learning - Supervised learningMachine Learning - Supervised learning
Machine Learning - Supervised learning
 
Brief introduction to Machine Learning
Brief introduction to Machine LearningBrief introduction to Machine Learning
Brief introduction to Machine Learning
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Deep Learning For Practitioners,  lecture 2: Selecting the right applications...Deep Learning For Practitioners,  lecture 2: Selecting the right applications...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
 
Word vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlmWord vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlm
 

Similar to Feature Engineering for Machine Learning at QConSP

Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...Amanda Casari
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Lukas Mandrake
 
Pushing Machine Learning Down the Security Stack to Make It More Effective fo...
Pushing Machine Learning Down the Security Stack to Make It More Effective fo...Pushing Machine Learning Down the Security Stack to Make It More Effective fo...
Pushing Machine Learning Down the Security Stack to Make It More Effective fo...Jonathan Sander
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9Roger Barga
 
From DBA to DE: Becoming a Data Engineer
From DBA to DE:  Becoming a Data Engineer From DBA to DE:  Becoming a Data Engineer
From DBA to DE: Becoming a Data Engineer Jim Czuprynski
 
Demystifying Machine Learning
Demystifying Machine LearningDemystifying Machine Learning
Demystifying Machine LearningAyodele Odubela
 
Computer notes - data structures
Computer notes - data structuresComputer notes - data structures
Computer notes - data structuresecomputernotes
 
Introduction to Artificial Intelligence...pptx
Introduction to Artificial Intelligence...pptxIntroduction to Artificial Intelligence...pptx
Introduction to Artificial Intelligence...pptxMMCOE, Karvenagar, Pune
 
Spark MLlib and Viral Tweets
Spark MLlib and Viral TweetsSpark MLlib and Viral Tweets
Spark MLlib and Viral TweetsAsim Jalis
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamDoug Needham
 
Unit-V Machine Learning.ppt
Unit-V Machine Learning.pptUnit-V Machine Learning.ppt
Unit-V Machine Learning.pptSharpmark256
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01shaziabibi5
 

Similar to Feature Engineering for Machine Learning at QConSP (20)

Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
ML.pdf
ML.pdfML.pdf
ML.pdf
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
On Impact in Software Engineering Research (HU Berlin 2021)
On Impact in Software Engineering Research (HU Berlin 2021)On Impact in Software Engineering Research (HU Berlin 2021)
On Impact in Software Engineering Research (HU Berlin 2021)
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
 
Pushing Machine Learning Down the Security Stack to Make It More Effective fo...
Pushing Machine Learning Down the Security Stack to Make It More Effective fo...Pushing Machine Learning Down the Security Stack to Make It More Effective fo...
Pushing Machine Learning Down the Security Stack to Make It More Effective fo...
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
From DBA to DE: Becoming a Data Engineer
From DBA to DE:  Becoming a Data Engineer From DBA to DE:  Becoming a Data Engineer
From DBA to DE: Becoming a Data Engineer
 
Demystifying Machine Learning
Demystifying Machine LearningDemystifying Machine Learning
Demystifying Machine Learning
 
Computer notes - data structures
Computer notes - data structuresComputer notes - data structures
Computer notes - data structures
 
Introduction to Artificial Intelligence...pptx
Introduction to Artificial Intelligence...pptxIntroduction to Artificial Intelligence...pptx
Introduction to Artificial Intelligence...pptx
 
Spark MLlib and Viral Tweets
Spark MLlib and Viral TweetsSpark MLlib and Viral Tweets
Spark MLlib and Viral Tweets
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
 
Taller2 parcial2 grupo_4_
Taller2 parcial2 grupo_4_Taller2 parcial2 grupo_4_
Taller2 parcial2 grupo_4_
 
Unit-V Machine Learning.ppt
Unit-V Machine Learning.pptUnit-V Machine Learning.ppt
Unit-V Machine Learning.ppt
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01
 
Collab365 Empower-Your-Applications-With-Azure-Machine-Learning
Collab365 Empower-Your-Applications-With-Azure-Machine-LearningCollab365 Empower-Your-Applications-With-Azure-Machine-Learning
Collab365 Empower-Your-Applications-With-Azure-Machine-Learning
 

More from Amanda Casari

When Privacy Scales - Intelligent Product Design under GDPR
When Privacy Scales - Intelligent Product Design under GDPRWhen Privacy Scales - Intelligent Product Design under GDPR
When Privacy Scales - Intelligent Product Design under GDPRAmanda Casari
 
Scaling Data Science Products, Not Data Science Teams
Scaling Data Science Products, Not Data Science TeamsScaling Data Science Products, Not Data Science Teams
Scaling Data Science Products, Not Data Science TeamsAmanda Casari
 
Spark Hearts GraphLab Create
Spark Hearts GraphLab CreateSpark Hearts GraphLab Create
Spark Hearts GraphLab CreateAmanda Casari
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyoneAmanda Casari
 
Design for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLabDesign for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLabAmanda Casari
 
PyLadies Seattle - Lessons in Interactive Visualizations
PyLadies Seattle - Lessons in Interactive VisualizationsPyLadies Seattle - Lessons in Interactive Visualizations
PyLadies Seattle - Lessons in Interactive VisualizationsAmanda Casari
 

More from Amanda Casari (7)

When Privacy Scales - Intelligent Product Design under GDPR
When Privacy Scales - Intelligent Product Design under GDPRWhen Privacy Scales - Intelligent Product Design under GDPR
When Privacy Scales - Intelligent Product Design under GDPR
 
Scaling Data Science Products, Not Data Science Teams
Scaling Data Science Products, Not Data Science TeamsScaling Data Science Products, Not Data Science Teams
Scaling Data Science Products, Not Data Science Teams
 
Spark Hearts GraphLab Create
Spark Hearts GraphLab CreateSpark Hearts GraphLab Create
Spark Hearts GraphLab Create
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone
 
Design for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLabDesign for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLab
 
PyLadies Seattle - Lessons in Interactive Visualizations
PyLadies Seattle - Lessons in Interactive VisualizationsPyLadies Seattle - Lessons in Interactive Visualizations
PyLadies Seattle - Lessons in Interactive Visualizations
 

Recently uploaded

Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROmotivationalword821
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 

Recently uploaded (20)

Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 

Feature Engineering for Machine Learning at QConSP

  • 1. Feature Engineering for Machine Learning Amanda Casari Principal Product Manager + Data Scientist Concur Labs @ SAP Concur @amcasari
  • 2. here to there via random walk product + data @ SAP Concur control systems engineering + robotics + legos officer in US Navy operations research analyst wandering dirtbag + conservation volunteer EE + applied math + complex systems underwater robotics consultant extraordinaire stay at home mom co-author NASA Datanaut @amcasarihere to there via random walk
  • 3. data science is not magic… @amcasari
  • 4. …but it is a process (sometimes painful) @amcasari@MROGATI
  • 5. it is easy to get turned around…. @amcasari idea research exploration hypotheses model outcomes feedback
  • 6. …and it is easy to get mixed up xkcd #1838 @amcasari
  • 7. …so let’s focus on getting from data to models feature engineering goes here! @amcasari
  • 8. when we say… DATA SCIENCE • …. the interdisciplinary intersection of methods, processes, algorithms and problem solving techniques to extract knowledge from data1 MACHINE LEARNING [ML] § …. fitting mathematical models to data in order to derive insights or make predictions.2 FEATURE § …. a numeric representation of an aspect of raw data2 FEATURE ENGINEERING § …. the act of extracting features from raw data and transforming them into formats that are suitable for the machine learning model2 hint: our community is well represented in Wikipedia @amcasari
  • 9. [n.b. ethics] DATA SOCIAL CONSTRUCT § …. “jointly constructed understandings of the world that form the basis for shared assumptions about reality”1 BIAS § … results from unfair sampling of a population, or from an estimation process that does not give accurate results on average2 ACCOUNTABILITY § … you are answerable for your decisions and obligated to be able to explain the resulting consequences3 hint: much more about this w/ @kjam at 14:30 § …. is an abstract representation of reality, not reality itself. Data is a part of the system of record, but not the actual system itself. @amcasari
  • 10. how to choose? 1/ FRAME YOUR PROBLEM 2/ UNDERSTAND YOUR DATA § What data will be most helpful to understand and generate a better understanding of this problem? 3/ FRAME YOUR FEATURE GOALS § What are you optimizing for? § Iteration speed § Model performance 4 / TEST, ITERATE, TEST AGAIN § Check your choices for robustness § Validate but realize this will still change § Can you frame your problem in a way that machine learning could be useful? e.g. prediction @amcasari
  • 11. vector space scalar: single numeric feature vector: ordered list of scalars Example: 1/ two-dimensional vector, v = [1, -1] @amcasari
  • 12. feature space In data, abstract vectors take on actual meaning Examples: • 1/ a vector can represent a person’s preference for songs • Song = feature • +1: Thumbs-up • -1: Thumbs-down • 2/ song represents ind. preferences in a group @amcasari
  • 13. Counts: Fancy Tricks with Simple Numbers
  • 16. counts: fixed width binning @amcasari
  • 18. @amcasari loga(ax) = x, where a is a positive constant and x can be any positive number a0=1, loga(1)=0 tl;dr the log function compresses the range of large numbers and expands the range of small numbers counts: log transform binning
  • 19. @amcasari What does scaling do for features?
  • 27. filter: frequency based filtering (stopwords) @amcasari These NLP libraries have both English + Portuguese corpora, models, etc 1/ spacy 2/ NLTK 3/ OpenNLP
  • 28. chunk: parts of speech matter @amcasari Pop Chart Lab, npr.org