How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.