- 1. Feature Engineering for Machine Learning Amanda Casari Principal Product Manager + Data Scientist Concur Labs @ SAP Concur @amcasari
- 2. here to there via random walk product + data @ SAP Concur control systems engineering + robotics + legos officer in US Navy operations research analyst wandering dirtbag + conservation volunteer EE + applied math + complex systems underwater robotics consultant extraordinaire stay at home mom co-author NASA Datanaut @amcasarihere to there via random walk
- 3. data science is not magic… @amcasari
- 4. …but it is a process (sometimes painful) @amcasari@MROGATI
- 5. it is easy to get turned around…. @amcasari idea research exploration hypotheses model outcomes feedback
- 6. …and it is easy to get mixed up xkcd #1838 @amcasari
- 7. …so let’s focus on getting from data to models feature engineering goes here! @amcasari
- 8. when we say… DATA SCIENCE • …. the interdisciplinary intersection of methods, processes, algorithms and problem solving techniques to extract knowledge from data1 MACHINE LEARNING [ML] § …. fitting mathematical models to data in order to derive insights or make predictions.2 FEATURE § …. a numeric representation of an aspect of raw data2 FEATURE ENGINEERING § …. the act of extracting features from raw data and transforming them into formats that are suitable for the machine learning model2 hint: our community is well represented in Wikipedia @amcasari
- 9. [n.b. ethics] DATA SOCIAL CONSTRUCT § …. “jointly constructed understandings of the world that form the basis for shared assumptions about reality”1 BIAS § … results from unfair sampling of a population, or from an estimation process that does not give accurate results on average2 ACCOUNTABILITY § … you are answerable for your decisions and obligated to be able to explain the resulting consequences3 hint: much more about this w/ @kjam at 14:30 § …. is an abstract representation of reality, not reality itself. Data is a part of the system of record, but not the actual system itself. @amcasari
- 10. how to choose? 1/ FRAME YOUR PROBLEM 2/ UNDERSTAND YOUR DATA § What data will be most helpful to understand and generate a better understanding of this problem? 3/ FRAME YOUR FEATURE GOALS § What are you optimizing for? § Iteration speed § Model performance 4 / TEST, ITERATE, TEST AGAIN § Check your choices for robustness § Validate but realize this will still change § Can you frame your problem in a way that machine learning could be useful? e.g. prediction @amcasari
- 11. vector space scalar: single numeric feature vector: ordered list of scalars Example: 1/ two-dimensional vector, v = [1, -1] @amcasari
- 12. feature space In data, abstract vectors take on actual meaning Examples: • 1/ a vector can represent a person’s preference for songs • Song = feature • +1: Thumbs-up • -1: Thumbs-down • 2/ song represents ind. preferences in a group @amcasari
- 13. Counts: Fancy Tricks with Simple Numbers
- 16. counts: fixed width binning @amcasari
- 18. @amcasari loga(ax) = x, where a is a positive constant and x can be any positive number a0=1, loga(1)=0 tl;dr the log function compresses the range of large numbers and expands the range of small numbers counts: log transform binning
- 19. @amcasari What does scaling do for features?
- 23. @amcasari proper scaling preserves underlying shape
- 24. Text: Flatten, Filter, Chunk
- 27. filter: frequency based filtering (stopwords) @amcasari These NLP libraries have both English + Portuguese corpora, models, etc 1/ spacy 2/ NLTK 3/ OpenNLP
- 28. chunk: parts of speech matter @amcasari Pop Chart Lab, npr.org
- 29. @amcasari thank you @RainyData code repobuy the book here!