The document discusses various techniques for preparing data in DataRobot, including importing datasets, removing irrelevant features, exploring datasets to create profiles, engineering data through projects by exploring values, splitting columns, changing types, and using regular expressions and computed functions to clean and transform data. It also covers publishing datasets to libraries, exporting datasets, and creating baselines.
3. DATA PREPARATION USING DATA PREP (PAXATA)
Import the dataset from the local file system or some data source
Rename the dataset and provide tags to the imported dataset
Remove features that are not relevant to model development
Baseline the dataset by creating a version through saving the dataset
Explore the dataset through creating a dataset profile, which will be stored as a separate file
March 2021
ENTERPRISE AI WITH DATAROBOT 3
13. WHAT NOT TO DO
DataRobot will take care of performing featurization tasks
Therefore, you do not need to perform the following
• Categorical variables encoding
• One-hot encoding
• Missing value imputation
Dataset Profiles do not provide exactly the same statistics provided by pandas (e.g. unique, value counts, describe)
• Feature statistics are however provided within the Project page (e.g. using FILTER values)
March 2021
ENTERPRISE AI WITH DATAROBOT 13
15. DATA ENGINEERING THROUGH ESTABLISHING A PROJECT
Create a project and select the imported dataset
Explore the features using "FILTER values"
Show value counts using "FILTER values"
Split a value into two values using "COLUMN split" and separator string matching
Remove leading space using "WHITESPACE trim leading and trailing"
Remove text from a column that should be numeric only using "COLUMN find + replace"
Change the type of the cleaned column using "CHANGE into ..." and "numeric"
March 2021
ENTERPRISE AI WITH DATAROBOT 15
33. DATA ENGINEERING USING REGULAR EXPRESSION
Extract data using "COLUMN split", regular expression, and capture mode
Comprehend the specification for number, character, and repetition in regular expression
Comprehend the use of brackets to extract data using a regular expression
Remove redundant feature columns using "COLUMNS"
Understand the impact of values with different units
March 2021
ENTERPRISE AI WITH DATAROBOT 33
45. DATA ENGINEERING USING COMPUTED FUNCTION
Create a new column using simple computation over other columns
Create a new column using multi-layer conditions over other columns
Cross-examine conditions against computed values using "FILTER values"
Cross-examine conditions against computed values using the list display of "FILTER values"
March 2021
ENTERPRISE AI WITH DATAROBOT 45
53. PUBLISHING & EXPORTING DATASET
Publish current dataset as an AnswerSet to the Library for shared access
Generate a dataset profile using the newly created AnswerSet
Export the current dataset to the file system
March 2021
ENTERPRISE AI WITH DATAROBOT 53