SlideShare a Scribd company logo
1 of 25
Download to read offline
Automated Feature Generation
& Amdahl's Law
By:
Simon Ouellette
(www.simonouellette.com)
1. What are machine learning features?
Automatic Feature Generation & Amdahl's Law
Automatic Feature Generation & Amdahl's Law
Examples
● Spam detection: presence or absence of certain email
headers, the email structure, the frequency of specific
terms, etc.
● Time-series analysis: lags of a variable, differentials of a
variable, polynomial forms, logarithms, etc.
● Computer vision: pixel color histograms, edges, etc.
Feature generation
● This step is generally the most expensive/difficult
in the machine learning "pipeline"
● No science to it, it's a black art (not a deductive
process, but an inductive one)
● Relies on a combination of domain knowledge
and trial and error.
2. Can we automate feature discovery?
The discovery process
● For the moment let's abstract out the domain
knowledge half of the equation.
● It boils down to a search/optimization problem:
find optimum multi-dimensional point in the
parameter space.
Parameter space
Theory vs Practice: a disclaimer
● In theory, the parameter space is infinite-
dimensional (there's an infinity of
transformations you can apply to the raw input
variables!)
● In practice, it's finite, but you can still never claim
to have searched the entire space of possible
solutions.
● Bottom line: utility. Is automation a useful, time
saving tool for data scientist?
Domain knowledge
● D.K. is a significant optimization that humans
use, and that is difficult for a machine to use.
● Semantic Web and knowledge ontologies:
● User-specified domain knowledge
– Variable type (quantitative, categorical, text,
entity, etc.)
– Time series or cross-sectional dataset
– Concept inheritance
– Known relationships / features of significance
● Access to public repositories of d.k.
3. Nectarine Demo
4. Challenges in parallelism
Amdahl's law
My encounter with Amdahl's law
● Ran my feature discovery algorithm on 100
millions row of data, and put it on the cloud to
make full use of parallelization.
● Quickly discovered that my algo didn't scale
beyond 8 cores. Adding cores didn't improve the
performance.
Automatic Feature Generation & Amdahl's Law
Automatic Feature Generation & Amdahl's Law
Cause: low-level parallelism
● Lot more overhead, shuffle, etc.
● Control flow often implies collecting
intermediate results to branch on them
(.count(), .reduce(), .sum(), .first(), etc.) – hence
non-parallel bottlenecks
● Every line of code not in a Spark closure is a non-
parallel bottleneck! (compare with
mapPartitions() approach of high-level
parallelism)
LASSOWithSGD
● LASSOWithSGD: SGD is Stochastic Gradient
Descent (an optimization algorithm). The cost
function is the non-parallelizable section.
● Gradient Descent is an iterative algorithm that
finds the minimum point (i.e. minimum model
error) in the space of model error per parameter
values.
Automatic Feature Generation & Amdahl's Law
LASSOWithSGD
● Calculating the gradients of the points can be done in parallel.
● But you have to reduce to a local sum, which implies serializing,
transfering to the driver, deserializing on the driver. Then you
have to update the weights, serialize send back to the executors,
and move on to the next step.
● This is what happens, as a general rule, with low-level parallelism.
You have to aggregate results locally to proceed to a next
iteration. This is non-parallelizable section of the code.
Solution: High-level parallelism
● Changed my machine learning libs to traditional
single-threaded ones (used SMILE)
● Instead I'm putting everything in a
mapPartitions() closure.
● The data is broadcast to all executors.
● No shuffle, only 1 local aggregation at the end
of the analysis. Highly parallelizable.
Automatic Feature Generation & Amdahl's Law
Conclusions
● High-level parallelism limitation: Dataset must
fit inside an executor's memory! (so not Big
Data! A few Gigabytes at most!)
● MLlib is for real Big Data (Tera bytes +).
● Even if your feature space is Big Data, you're
probably better off without MLlib.
Recruiting!
● Data scientists and developers
● Full-time, part-time, sweat equity, etc.

More Related Content

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Featured

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Automatic Feature Generation & Amdahl's Law

  • 1. Automated Feature Generation & Amdahl's Law By: Simon Ouellette (www.simonouellette.com)
  • 2. 1. What are machine learning features?
  • 5. Examples ● Spam detection: presence or absence of certain email headers, the email structure, the frequency of specific terms, etc. ● Time-series analysis: lags of a variable, differentials of a variable, polynomial forms, logarithms, etc. ● Computer vision: pixel color histograms, edges, etc.
  • 6. Feature generation ● This step is generally the most expensive/difficult in the machine learning "pipeline" ● No science to it, it's a black art (not a deductive process, but an inductive one) ● Relies on a combination of domain knowledge and trial and error.
  • 7. 2. Can we automate feature discovery?
  • 8. The discovery process ● For the moment let's abstract out the domain knowledge half of the equation. ● It boils down to a search/optimization problem: find optimum multi-dimensional point in the parameter space.
  • 10. Theory vs Practice: a disclaimer ● In theory, the parameter space is infinite- dimensional (there's an infinity of transformations you can apply to the raw input variables!) ● In practice, it's finite, but you can still never claim to have searched the entire space of possible solutions. ● Bottom line: utility. Is automation a useful, time saving tool for data scientist?
  • 11. Domain knowledge ● D.K. is a significant optimization that humans use, and that is difficult for a machine to use. ● Semantic Web and knowledge ontologies: ● User-specified domain knowledge – Variable type (quantitative, categorical, text, entity, etc.) – Time series or cross-sectional dataset – Concept inheritance – Known relationships / features of significance ● Access to public repositories of d.k.
  • 13. 4. Challenges in parallelism
  • 15. My encounter with Amdahl's law ● Ran my feature discovery algorithm on 100 millions row of data, and put it on the cloud to make full use of parallelization. ● Quickly discovered that my algo didn't scale beyond 8 cores. Adding cores didn't improve the performance.
  • 18. Cause: low-level parallelism ● Lot more overhead, shuffle, etc. ● Control flow often implies collecting intermediate results to branch on them (.count(), .reduce(), .sum(), .first(), etc.) – hence non-parallel bottlenecks ● Every line of code not in a Spark closure is a non- parallel bottleneck! (compare with mapPartitions() approach of high-level parallelism)
  • 19. LASSOWithSGD ● LASSOWithSGD: SGD is Stochastic Gradient Descent (an optimization algorithm). The cost function is the non-parallelizable section. ● Gradient Descent is an iterative algorithm that finds the minimum point (i.e. minimum model error) in the space of model error per parameter values.
  • 21. LASSOWithSGD ● Calculating the gradients of the points can be done in parallel. ● But you have to reduce to a local sum, which implies serializing, transfering to the driver, deserializing on the driver. Then you have to update the weights, serialize send back to the executors, and move on to the next step. ● This is what happens, as a general rule, with low-level parallelism. You have to aggregate results locally to proceed to a next iteration. This is non-parallelizable section of the code.
  • 22. Solution: High-level parallelism ● Changed my machine learning libs to traditional single-threaded ones (used SMILE) ● Instead I'm putting everything in a mapPartitions() closure. ● The data is broadcast to all executors. ● No shuffle, only 1 local aggregation at the end of the analysis. Highly parallelizable.
  • 24. Conclusions ● High-level parallelism limitation: Dataset must fit inside an executor's memory! (so not Big Data! A few Gigabytes at most!) ● MLlib is for real Big Data (Tera bytes +). ● Even if your feature space is Big Data, you're probably better off without MLlib.
  • 25. Recruiting! ● Data scientists and developers ● Full-time, part-time, sweat equity, etc.