SlideShare a Scribd company logo
BigML Education
Flatline
July 2017
BigML Education Program 2Flatline
In This Video
• Introduction
• Define Flatline: Feature Engineering & Filtering
• Define Feature Engineering: why is it important?
• Examples: Automatic Feature Engineering / Filtering
• Basic Features
• Explore built-in functions for FE / Filtering
• Refresh options
• Advanced Features
• Formally introduce Flatline s-expressions
• Lisp-style expressions for FE and filtering.
• Examples of custom transformations
BigML Education Program 3Flatline
Flatline
Introduction
BigML Education Program 4Flatline
What is Flatline?
• DSL:
• Invented by BigML - Programmatic / Optimized for speed
• Transforms datasets into new datasets
• Adding new fields / Filtering
• Transformations are written in lisp-style syntax
• Programmatic Filtering:
• Filtering datasets according to functions that evaluate to
true/false using the row of data as an input.
• Feature Engineering???
Flatline: a domain specific language for feature
engineering and programmatic filtering
BigML Education Program 5Flatline
What is Feature Engineering?
• This is really, really important - more than algorithm selection!
• ML Algorithms have no deeper understanding of data
• Numerical: have a natural order, can be scaled, etc
• Categorical: have discrete values, etc.
• The "magic" is the ability to find patterns quickly and efficiently
• ML Algorithms only know what you tell/show it with data
• Medical: Kg and M, but BMI = Kg/M2 is better
• Lending: Debt and Income, but DTI is better
• Intuition can be risky: remember to prove it with an evaluation!
Feature Engineering: applying domain knowledge of
the data to create new features that allow ML
algorithms to work better, or to work at all.
BigML Education Program 6Flatline
Automatic Transformations
2013-09-25 10:02
DATE-TIME
Date-Time Fields
… year month day hour minute …
… 2013 Sep 25 10 2 …
… … … … … … …
NUM NUMCAT NUM NUM
• Date-Time fields have a lot of information "packed" into them
• Splitting out the time components allows ML algorithms to
discover time-based patterns.
BigML Education Program 7Flatline
Automatic Transformations
Categorical Fields for Clustering/LR
… alchemy_category …
… business …
… recreation …
… health …
… … …
CAT
business health recreation …
… 1 0 0 …
… 0 0 1 …
… 0 1 0 …
… … … … …
NUM NUM NUM
• Clustering and Logistic Regression require numeric fields for
inputs
• Categorical values are transformed to numeric vectors
automatically*
*Note: In BigML, clustering uses k-prototypes and the encoding used for LR can be configured.
BigML Education Program 8Flatline
Automatic Transformations
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon ‘em.
TEXT
Text Fields
… great afraid born achieve …
… 4 1 1 1 …
… … … … … …
NUM NUM NUM NUM
• Unstructured text contains a lot of potentially interesting patterns
• Bag-of-words analysis happens automatically and extracts the
"interesting" tokens in the text
• Another option is Topic Modeling to extract thematic meaning
BigML Education Program 9Flatline
Help ML to Work Better
{
“url":"cbsnews",
"title":"Breaking News Headlines
Business Entertainment World News “,
"body":" news covering all the latest
breaking national and world news
headlines, including politics, sports,
entertainment, business and more.”
}
TEXT
title body
Breaking News… news covering…
… …
TEXT TEXT
When text is not actually unstructured
• In this case, the text field has structure (key/value pairs)
• Extracting the structure as new features may allow the ML
algorithm to work better
BigML Education Program 10Flatline
Help ML to Work at all
When the pattern does not exist
Highway Number Direction Is Long
2 East-West FALSE
4 East-West FALSE
5 North-South TRUE
8 East-West FALSE
10 East-West TRUE
… … …
Goal: Predict principle direction from highway number
( = (mod (field "Highway Number") 2) 0)
BigML Education Program 11Flatline
Flatline
Basic Features
BigML Education Program 12Flatline
Built-ins for Filtering
• Built-in filtering:
• Common filtering options for each field type
• Internally the built-in filtering writes a Flatline expression
• Numeric Fields: Comparison / Equals / Missing / Statistics
• Date-Time: Comparison / Equals / Missing
• Categorical Fields: Specific values / Missing values
• Text Fields: Equals / Contains / Doesn’t contain / Missing
• Refresh Fields:
• Types: recomputes field types. Ex: #classes > 1000
• Preferred: recomputes preferred status
BigML Education Program 13Flatline
Built-ins for Filtering
• boilerplate contains "cup"
• alchemy_category is not "sports" nor "unknown"
• alchemy_category_score >= 0.70
StumbleUpon Filtering Example
BigML Education Program 14Flatline
Built-ins for FE
• Discretize: Converts a numeric value to categorical
• Replace missing values: fixed/max/mean/median/etc
• Normalize: Adjust a numeric value to a specific range of values
while preserving the distribution
• Math: Exponentiation, Logarithms, Squares, Roots, etc
• Types: Force a field value to categorical, integer, or real
• Random: Create random values for introducing noise
• Statistics: Mean, Population
• Refresh Fields:
• Types: recomputes field types. Ex: #classes > 1000
• Preferred: recomputes preferred status
BigML Education Program 15Flatline
Built-in Functions
Discretization
Total Spend
7.342,99
304,12
4,56
345,87
8.546,32
NUM
“Predict will spend
$3,521 with error
$1,232”
Spend Category
Top 33%
Bottom 33%
Bottom 33%
Middle 33%
Top 33%
CAT
“Predict customer
will be Top 33% in
spending”
BigML Education Program 16Flatline
Flatline
Advanced Features
BigML Education Program 17Flatline
Flatline
• Lisp style syntax: Operators come first
• Correct: (+ 1 2) : NOT Correct: (1 + 2)
• Dataset Fields are first-class citizens
• (field “diabetes pedigree”)
• Limited programming language structures
• let, cond, if, maps, list operators, */+-, etc.
• Built-in transformations
• statistics, strings, timestamps, windows
BigML Education Program 18Flatline
Flatline Add Fields
Computing with Existing Features
Debt Income
10.134 100.000
85.234 134.000
8.112 21.500
0 45.900
17.534 52.000
NUM NUM
(/ (field "Debt") (field "Income"))
Debt
Income
Debt to Income Ratio
0,10
0,64
0,38
0
0,34
NUM
BigML Education Program 19Flatline
Flatline Filtering
• Any flatline expression that evaluates to true/false
can be used to filter a dataset
• If the expression is true for the row, it will be
kept in the output dataset
• If the expression is false for the row, it will be
left out from the output dataset
• Filtering can be direct or sometimes in two steps:
1. Add field to dataset
2. Filter new field with a built-in function
BigML Education Program 20Flatline
Flatline Filtering
Filtering by Computation
Debt Income
10.134 100.000
85.234 134.000
8.112 21.500
0 45.900
17.534 52.000
NUM NUM
(< (/ (field "Debt") (field "Income")) 0.35)
Debt
Income
< 0.35
Debt Income
10.134 100.000
85.234 134.000
8.112 21.500
0 45.900
17.534 52.000
NUM NUM
BigML Education Program 21Flatline
Flatline s-expressions
(= 0 (+ (abs ( f "Month - 3" ) ) (abs ( f "Month - 2")) (abs ( f "Month - 1") ) ))
Name Month - 3 Month - 2 Month - 1
Joe Schmo 123,23 0 0
Jane Plain 0 0 0
Mary Happy 0 55,22 243,33
Tom Thumb 12,34 8,34 14,56
Un-Labelled Data
Labelled data
Name Month - 3 Month - 2 Month - 1 Default
Joe Schmo 123,23 0 0 FALSE
Jane Plain 0 0 0 TRUE
Mary Happy 0 55,22 243,33 FALSE
Tom Thumb 12,34 8,34 14,56 FALSE
Adding Simple Labels to Data
Define "default" as
missing three
payments in a row
BigML Education Program 22Flatline
Flatline s-expressions
date volume price
1 34353 314
2 44455 315
3 22333 315
4 52322 321
5 28000 320
6 31254 319
7 56544 323
8 44331 324
9 81111 287
10 65422 294
11 59999 300
12 45556 302
13 19899 301
Current - (4-day avg)
std dev
Shock: Deviations from a Trend
day-4 day-3 day-2 day-1 4davg
-
314 -
314 315 -
314 315 315 -
314 315 315 321 316,25
315 315 321 320 317,75
315 321 320 319 318,75
BigML Education Program 23Flatline
Flatline s-expressions
Current - (4-day avg)
std dev
Shock: Deviations from a Trend
Current : (field “price”)
4-day avg: (avg-window “price” -4 -1)
std dev: (standard-deviation “price”)
(/ (- ( f "price") (avg-window "price" -4, -1)) (standard-deviation "price"))
BigML Education Program 24Flatline
Advanced s-expressions
Moon Phase%
( / ( mod ( - ( / ( epoch ( field "date-field" )) 1000 ) 621300 ) 2551443 ) 2551442 )
Highway isEven?
( = (mod (field "Highway Number") 2) 0)
( let (R 6371000 latA (to-radians {lat-ref}) latB (to-radians ( field "LATITUDE" ) )
latD ( - latB latA ) longD ( to-radians ( - ( field "LONGITUDE" ) {long-ref} ) ) a (
+ ( square ( sin ( / latD 2 ) ) ) ( * (cos latA) (cos latB) (square ( sin ( / longD
2))) ) ) c ( * 2 ( asin ( min (list 1 (sqrt a)) ) ) ) ) ( * R c ) )
Distance Lat/Long <=> Ref (Haversine)
BigML Education Program 25Flatline
WhizzML + Flatline
HAVERSINE
FLATLINE
OUTPUT
DATASET
INPUT
DATASET
LONG Ref
LAT Ref
WHIZZML SCRIPT
GALLERY
BigML Education Program 26Flatline
Summary
• Flatline: programmatic feature engineering / filtering
• Feature Engineering: what is it / why it is important
• Automatic transformations: date-time, text, etc
• Built-in functions: filtering and feature engineering
• Discretization / Normalization / etc.
• Flatline expressions
• Structure
• Examples: Adding fields / filtering
• WhizzML
• Wrapping Flatline
• Quick start with scriptify

More Related Content

What's hot

What's hot (10)

Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
Apache Spark Data Validation
Apache Spark Data ValidationApache Spark Data Validation
Apache Spark Data Validation
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
VOWLMap: Graph-based Ontology Alignment Visualization and Editing
VOWLMap: Graph-based Ontology Alignment Visualization and EditingVOWLMap: Graph-based Ontology Alignment Visualization and Editing
VOWLMap: Graph-based Ontology Alignment Visualization and Editing
 
Building Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field ExperienceBuilding Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field Experience
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Evotec - How can Knowledge Graphs support Druh Discovery
Evotec - How can Knowledge Graphs support Druh DiscoveryEvotec - How can Knowledge Graphs support Druh Discovery
Evotec - How can Knowledge Graphs support Druh Discovery
 

Similar to BigML Education - Feature Engineering with Flatline

Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 

Similar to BigML Education - Feature Engineering with Flatline (20)

BSSML17 - Feature Engineering
BSSML17 - Feature EngineeringBSSML17 - Feature Engineering
BSSML17 - Feature Engineering
 
DutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingDutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision Making
 
VSSML18. Feature Engineering
VSSML18. Feature EngineeringVSSML18. Feature Engineering
VSSML18. Feature Engineering
 
MLSD18. Feature Engineering
MLSD18. Feature EngineeringMLSD18. Feature Engineering
MLSD18. Feature Engineering
 
MLSEV. Automating Decision Making
MLSEV. Automating Decision MakingMLSEV. Automating Decision Making
MLSEV. Automating Decision Making
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature Engineering
 
VSSML18. Introduction to Machine Learning and the BigML Platform
VSSML18. Introduction to Machine Learning and the BigML PlatformVSSML18. Introduction to Machine Learning and the BigML Platform
VSSML18. Introduction to Machine Learning and the BigML Platform
 
Practical data science
Practical data sciencePractical data science
Practical data science
 
BSSML17 - Basic Data Transformations
BSSML17 - Basic Data TransformationsBSSML17 - Basic Data Transformations
BSSML17 - Basic Data Transformations
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
AlphaPy
AlphaPyAlphaPy
AlphaPy
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in Python
 
VSSML18. Data Transformations
VSSML18. Data TransformationsVSSML18. Data Transformations
VSSML18. Data Transformations
 
VSSML16 L5. Basic Data Transformations
VSSML16 L5. Basic Data TransformationsVSSML16 L5. Basic Data Transformations
VSSML16 L5. Basic Data Transformations
 
Citizen Data Science Training using KNIME
Citizen Data Science Training using KNIMECitizen Data Science Training using KNIME
Citizen Data Science Training using KNIME
 
Web UI, Algorithms, and Feature Engineering
Web UI, Algorithms, and Feature Engineering Web UI, Algorithms, and Feature Engineering
Web UI, Algorithms, and Feature Engineering
 
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIMEPredicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
 
When Should I Use Simulation?
When Should I Use Simulation?When Should I Use Simulation?
When Should I Use Simulation?
 
VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2VSSML16 LR2. Summary Day 2
VSSML16 LR2. Summary Day 2
 
MLIntro_ADA.pptx
MLIntro_ADA.pptxMLIntro_ADA.pptx
MLIntro_ADA.pptx
 

More from BigML, Inc

More from BigML, Inc (20)

Digital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in ManufacturingDigital Transformation and Process Optimization in Manufacturing
Digital Transformation and Process Optimization in Manufacturing
 
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - AutomationDutchMLSchool 2022 - Automation
DutchMLSchool 2022 - Automation
 
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML ComplianceDutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - ML for AML Compliance
 
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective AnomaliesDutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - Multi Perspective Anomalies
 
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - My First Anomaly Detector
 
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly DetectionDutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - Anomaly Detection
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End MLDutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - End-to-End ML
 
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven CompanyDutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - A Data-Driven Company
 
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal SectorDutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - ML in the Legal Sector
 
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe StadiumsDutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Smart Safe Stadiums
 
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsDutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
 
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at ScaleDutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Anomaly Detection at Scale
 
DutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AIDutchMLSchool 2022 - Citizen Development in AI
DutchMLSchool 2022 - Citizen Development in AI
 
Democratizing Object Detection
Democratizing Object DetectionDemocratizing Object Detection
Democratizing Object Detection
 
BigML Release: Image Processing
BigML Release: Image ProcessingBigML Release: Image Processing
BigML Release: Image Processing
 
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureMachine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
 
Machine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail SectorMachine Learning in Retail: ML in the Retail Sector
Machine Learning in Retail: ML in the Retail Sector
 
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
 
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
 

Recently uploaded

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 

Recently uploaded (20)

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 

BigML Education - Feature Engineering with Flatline

  • 2. BigML Education Program 2Flatline In This Video • Introduction • Define Flatline: Feature Engineering & Filtering • Define Feature Engineering: why is it important? • Examples: Automatic Feature Engineering / Filtering • Basic Features • Explore built-in functions for FE / Filtering • Refresh options • Advanced Features • Formally introduce Flatline s-expressions • Lisp-style expressions for FE and filtering. • Examples of custom transformations
  • 3. BigML Education Program 3Flatline Flatline Introduction
  • 4. BigML Education Program 4Flatline What is Flatline? • DSL: • Invented by BigML - Programmatic / Optimized for speed • Transforms datasets into new datasets • Adding new fields / Filtering • Transformations are written in lisp-style syntax • Programmatic Filtering: • Filtering datasets according to functions that evaluate to true/false using the row of data as an input. • Feature Engineering??? Flatline: a domain specific language for feature engineering and programmatic filtering
  • 5. BigML Education Program 5Flatline What is Feature Engineering? • This is really, really important - more than algorithm selection! • ML Algorithms have no deeper understanding of data • Numerical: have a natural order, can be scaled, etc • Categorical: have discrete values, etc. • The "magic" is the ability to find patterns quickly and efficiently • ML Algorithms only know what you tell/show it with data • Medical: Kg and M, but BMI = Kg/M2 is better • Lending: Debt and Income, but DTI is better • Intuition can be risky: remember to prove it with an evaluation! Feature Engineering: applying domain knowledge of the data to create new features that allow ML algorithms to work better, or to work at all.
  • 6. BigML Education Program 6Flatline Automatic Transformations 2013-09-25 10:02 DATE-TIME Date-Time Fields … year month day hour minute … … 2013 Sep 25 10 2 … … … … … … … … NUM NUMCAT NUM NUM • Date-Time fields have a lot of information "packed" into them • Splitting out the time components allows ML algorithms to discover time-based patterns.
  • 7. BigML Education Program 7Flatline Automatic Transformations Categorical Fields for Clustering/LR … alchemy_category … … business … … recreation … … health … … … … CAT business health recreation … … 1 0 0 … … 0 0 1 … … 0 1 0 … … … … … … NUM NUM NUM • Clustering and Logistic Regression require numeric fields for inputs • Categorical values are transformed to numeric vectors automatically* *Note: In BigML, clustering uses k-prototypes and the encoding used for LR can be configured.
  • 8. BigML Education Program 8Flatline Automatic Transformations Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon ‘em. TEXT Text Fields … great afraid born achieve … … 4 1 1 1 … … … … … … … NUM NUM NUM NUM • Unstructured text contains a lot of potentially interesting patterns • Bag-of-words analysis happens automatically and extracts the "interesting" tokens in the text • Another option is Topic Modeling to extract thematic meaning
  • 9. BigML Education Program 9Flatline Help ML to Work Better { “url":"cbsnews", "title":"Breaking News Headlines Business Entertainment World News “, "body":" news covering all the latest breaking national and world news headlines, including politics, sports, entertainment, business and more.” } TEXT title body Breaking News… news covering… … … TEXT TEXT When text is not actually unstructured • In this case, the text field has structure (key/value pairs) • Extracting the structure as new features may allow the ML algorithm to work better
  • 10. BigML Education Program 10Flatline Help ML to Work at all When the pattern does not exist Highway Number Direction Is Long 2 East-West FALSE 4 East-West FALSE 5 North-South TRUE 8 East-West FALSE 10 East-West TRUE … … … Goal: Predict principle direction from highway number ( = (mod (field "Highway Number") 2) 0)
  • 11. BigML Education Program 11Flatline Flatline Basic Features
  • 12. BigML Education Program 12Flatline Built-ins for Filtering • Built-in filtering: • Common filtering options for each field type • Internally the built-in filtering writes a Flatline expression • Numeric Fields: Comparison / Equals / Missing / Statistics • Date-Time: Comparison / Equals / Missing • Categorical Fields: Specific values / Missing values • Text Fields: Equals / Contains / Doesn’t contain / Missing • Refresh Fields: • Types: recomputes field types. Ex: #classes > 1000 • Preferred: recomputes preferred status
  • 13. BigML Education Program 13Flatline Built-ins for Filtering • boilerplate contains "cup" • alchemy_category is not "sports" nor "unknown" • alchemy_category_score >= 0.70 StumbleUpon Filtering Example
  • 14. BigML Education Program 14Flatline Built-ins for FE • Discretize: Converts a numeric value to categorical • Replace missing values: fixed/max/mean/median/etc • Normalize: Adjust a numeric value to a specific range of values while preserving the distribution • Math: Exponentiation, Logarithms, Squares, Roots, etc • Types: Force a field value to categorical, integer, or real • Random: Create random values for introducing noise • Statistics: Mean, Population • Refresh Fields: • Types: recomputes field types. Ex: #classes > 1000 • Preferred: recomputes preferred status
  • 15. BigML Education Program 15Flatline Built-in Functions Discretization Total Spend 7.342,99 304,12 4,56 345,87 8.546,32 NUM “Predict will spend $3,521 with error $1,232” Spend Category Top 33% Bottom 33% Bottom 33% Middle 33% Top 33% CAT “Predict customer will be Top 33% in spending”
  • 16. BigML Education Program 16Flatline Flatline Advanced Features
  • 17. BigML Education Program 17Flatline Flatline • Lisp style syntax: Operators come first • Correct: (+ 1 2) : NOT Correct: (1 + 2) • Dataset Fields are first-class citizens • (field “diabetes pedigree”) • Limited programming language structures • let, cond, if, maps, list operators, */+-, etc. • Built-in transformations • statistics, strings, timestamps, windows
  • 18. BigML Education Program 18Flatline Flatline Add Fields Computing with Existing Features Debt Income 10.134 100.000 85.234 134.000 8.112 21.500 0 45.900 17.534 52.000 NUM NUM (/ (field "Debt") (field "Income")) Debt Income Debt to Income Ratio 0,10 0,64 0,38 0 0,34 NUM
  • 19. BigML Education Program 19Flatline Flatline Filtering • Any flatline expression that evaluates to true/false can be used to filter a dataset • If the expression is true for the row, it will be kept in the output dataset • If the expression is false for the row, it will be left out from the output dataset • Filtering can be direct or sometimes in two steps: 1. Add field to dataset 2. Filter new field with a built-in function
  • 20. BigML Education Program 20Flatline Flatline Filtering Filtering by Computation Debt Income 10.134 100.000 85.234 134.000 8.112 21.500 0 45.900 17.534 52.000 NUM NUM (< (/ (field "Debt") (field "Income")) 0.35) Debt Income < 0.35 Debt Income 10.134 100.000 85.234 134.000 8.112 21.500 0 45.900 17.534 52.000 NUM NUM
  • 21. BigML Education Program 21Flatline Flatline s-expressions (= 0 (+ (abs ( f "Month - 3" ) ) (abs ( f "Month - 2")) (abs ( f "Month - 1") ) )) Name Month - 3 Month - 2 Month - 1 Joe Schmo 123,23 0 0 Jane Plain 0 0 0 Mary Happy 0 55,22 243,33 Tom Thumb 12,34 8,34 14,56 Un-Labelled Data Labelled data Name Month - 3 Month - 2 Month - 1 Default Joe Schmo 123,23 0 0 FALSE Jane Plain 0 0 0 TRUE Mary Happy 0 55,22 243,33 FALSE Tom Thumb 12,34 8,34 14,56 FALSE Adding Simple Labels to Data Define "default" as missing three payments in a row
  • 22. BigML Education Program 22Flatline Flatline s-expressions date volume price 1 34353 314 2 44455 315 3 22333 315 4 52322 321 5 28000 320 6 31254 319 7 56544 323 8 44331 324 9 81111 287 10 65422 294 11 59999 300 12 45556 302 13 19899 301 Current - (4-day avg) std dev Shock: Deviations from a Trend day-4 day-3 day-2 day-1 4davg - 314 - 314 315 - 314 315 315 - 314 315 315 321 316,25 315 315 321 320 317,75 315 321 320 319 318,75
  • 23. BigML Education Program 23Flatline Flatline s-expressions Current - (4-day avg) std dev Shock: Deviations from a Trend Current : (field “price”) 4-day avg: (avg-window “price” -4 -1) std dev: (standard-deviation “price”) (/ (- ( f "price") (avg-window "price" -4, -1)) (standard-deviation "price"))
  • 24. BigML Education Program 24Flatline Advanced s-expressions Moon Phase% ( / ( mod ( - ( / ( epoch ( field "date-field" )) 1000 ) 621300 ) 2551443 ) 2551442 ) Highway isEven? ( = (mod (field "Highway Number") 2) 0) ( let (R 6371000 latA (to-radians {lat-ref}) latB (to-radians ( field "LATITUDE" ) ) latD ( - latB latA ) longD ( to-radians ( - ( field "LONGITUDE" ) {long-ref} ) ) a ( + ( square ( sin ( / latD 2 ) ) ) ( * (cos latA) (cos latB) (square ( sin ( / longD 2))) ) ) c ( * 2 ( asin ( min (list 1 (sqrt a)) ) ) ) ) ( * R c ) ) Distance Lat/Long <=> Ref (Haversine)
  • 25. BigML Education Program 25Flatline WhizzML + Flatline HAVERSINE FLATLINE OUTPUT DATASET INPUT DATASET LONG Ref LAT Ref WHIZZML SCRIPT GALLERY
  • 26. BigML Education Program 26Flatline Summary • Flatline: programmatic feature engineering / filtering • Feature Engineering: what is it / why it is important • Automatic transformations: date-time, text, etc • Built-in functions: filtering and feature engineering • Discretization / Normalization / etc. • Flatline expressions • Structure • Examples: Adding fields / filtering • WhizzML • Wrapping Flatline • Quick start with scriptify