Automatic Feature Generation & Amdahl's Law

•

0 likes•118 views

A presentation at the Montreal Apache Spark Meetup about automatic feature generation and some parallelism challenges related to Amdahl's Law and MLlib's implementation of LASSO regression.

Technology

Automated Feature Generation
& Amdahl's Law
By:
Simon Ouellette
(www.simonouellette.com)

Automatic Feature Generation & Amdahl's Law

Examples
● Spam detection: presence or absence of certain email
headers, the email structure, the frequency of specific
terms, etc.
● Time-series analysis: lags of a variable, differentials of a
variable, polynomial forms, logarithms, etc.
● Computer vision: pixel color histograms, edges, etc.

Feature generation
● This step is generally the most expensive/difficult
in the machine learning "pipeline"
● No science to it, it's a black art (not a deductive
process, but an inductive one)
● Relies on a combination of domain knowledge
and trial and error.

The discovery process
● For the moment let's abstract out the domain
knowledge half of the equation.
● It boils down to a search/optimization problem:
find optimum multi-dimensional point in the
parameter space.

Theory vs Practice: a disclaimer
● In theory, the parameter space is infinite-
dimensional (there's an infinity of
transformations you can apply to the raw input
variables!)
● In practice, it's finite, but you can still never claim
to have searched the entire space of possible
solutions.
● Bottom line: utility. Is automation a useful, time
saving tool for data scientist?

Domain knowledge
● D.K. is a significant optimization that humans
use, and that is difficult for a machine to use.
● Semantic Web and knowledge ontologies:
● User-specified domain knowledge
– Variable type (quantitative, categorical, text,
entity, etc.)
– Time series or cross-sectional dataset
– Concept inheritance
– Known relationships / features of significance
● Access to public repositories of d.k.

My encounter with Amdahl's law
● Ran my feature discovery algorithm on 100
millions row of data, and put it on the cloud to
make full use of parallelization.
● Quickly discovered that my algo didn't scale
beyond 8 cores. Adding cores didn't improve the
performance.

Cause: low-level parallelism
● Lot more overhead, shuffle, etc.
● Control flow often implies collecting
intermediate results to branch on them
(.count(), .reduce(), .sum(), .first(), etc.) – hence
non-parallel bottlenecks
● Every line of code not in a Spark closure is a non-
parallel bottleneck! (compare with
mapPartitions() approach of high-level
parallelism)

LASSOWithSGD
● LASSOWithSGD: SGD is Stochastic Gradient
Descent (an optimization algorithm). The cost
function is the non-parallelizable section.
● Gradient Descent is an iterative algorithm that
finds the minimum point (i.e. minimum model
error) in the space of model error per parameter
values.

LASSOWithSGD
● Calculating the gradients of the points can be done in parallel.
● But you have to reduce to a local sum, which implies serializing,
transfering to the driver, deserializing on the driver. Then you
have to update the weights, serialize send back to the executors,
and move on to the next step.
● This is what happens, as a general rule, with low-level parallelism.
You have to aggregate results locally to proceed to a next
iteration. This is non-parallelizable section of the code.

Solution: High-level parallelism
● Changed my machine learning libs to traditional
single-threaded ones (used SMILE)
● Instead I'm putting everything in a
mapPartitions() closure.
● The data is broadcast to all executors.
● No shuffle, only 1 local aggregation at the end
of the analysis. Highly parallelizable.

Conclusions
● High-level parallelism limitation: Dataset must
fit inside an executor's memory! (so not Big
Data! A few Gigabytes at most!)
● MLlib is for real Big Data (Tera bytes +).
● Even if your feature space is Big Data, you're
probably better off without MLlib.

Recruiting!
● Data scientists and developers
● Full-time, part-time, sweat equity, etc.

Recently uploaded

Partners Life - Insurer Innovation Award 2024

The Digital Insurer

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Neo4j

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

Advantages of Hiring UIUX Design Service Providers for Your Business

Pixlogix Infotech

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

This presentation explores the impact of HTML injection attacks on web applications, detailing how attackers exploit vulnerabilities to inject malicious code into web pages. Learn about the potential consequences of such attacks and discover effective mitigation strategies to protect your web applications from HTML injection vulnerabilities. for more information visit https://bostoninstituteofanalytics.org/category/cyber-security-ethical-hacking/

HTML Injection Attacks: Impact and Mitigation Strategies

Boston Institute of Analytics

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Real Time Object Detection Using Open CV

Khem

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

Enterprise Knowledge’s Urmi Majumder, Principal Data Architecture Consultant, and Fernando Aguilar Islas, Senior Data Science Consultant, presented "Driving Behavioral Change for Information Management through Data-Driven Green Strategy" on March 27, 2024 at Enterprise Data World (EDW) in Orlando, Florida. In this presentation, Urmi and Fernando discussed a case study describing how the information management division in a large supply chain organization drove user behavior change through awareness of the carbon footprint of their duplicated and near-duplicated content, identified via advanced data analytics. Check out their presentation to gain valuable perspectives on utilizing data-driven strategies to influence positive behavioral shifts and support sustainability initiatives within your organization. In this session, participants gained answers to the following questions: - What is a Green Information Management (IM) Strategy, and why should you have one? - How can Artificial Intelligence (AI) and Machine Learning (ML) support your Green IM Strategy through content deduplication? - How can an organization use insights into their data to influence employee behavior for IM? - How can you reap additional benefits from content reduction that go beyond Green IM?

Driving Behavioral Change for Information Management through Data-Driven Gree...

Enterprise Knowledge

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

Recently uploaded (20)

Partners Life - Insurer Innovation Award 2024

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Boost PC performance: How more available memory can improve productivity

Exploring the Future Potential of AI-Enabled Smartphone Processors

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Advantages of Hiring UIUX Design Service Providers for Your Business

Data Cloud, More than a CDP by Matt Robison

HTML Injection Attacks: Impact and Mitigation Strategies

Artificial Intelligence: Facts and Myths

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

🐬 The future of MySQL is Postgres 🐘

A Domino Admins Adventures (Engage 2024)

Boost Fertility New Invention Ups Success Rates.pdf

Real Time Object Detection Using Open CV

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Driving Behavioral Change for Information Management through Data-Driven Gree...

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Featured

Skeleton Culture Code

Skeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024

Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)

contently

How to Prepare For a Successful Job Search for 2024

Albert Qian

A report by thenetworkone and Kurio. The contributing experts and agencies are (in an alphabetical order): Sylwia Rytel, Social Media Supervisor, 180heartbeats + JUNG v MATT (PL), Sharlene Jenner, Vice President - Director of Engagement Strategy, Abelson Taylor (USA), Alex Casanovas, Digital Director, Atrevia (ES), Dora Beilin, Senior Social Strategist, Barrett Hoffher (USA), Min Seo, Campaign Director, Brand New Agency (KR), Deshé M. Gully, Associate Strategist, Day One Agency (USA), Francesca Trevisan, Strategist, Different (IT), Trevor Crossman, CX and Digital Transformation Director; Olivia Hussey, Strategic Planner; Simi Srinarula, Social Media Manager, The Hallway (AUS), James Hebbert, Managing Director, Hylink (CN / UK), Mundy Álvarez, Planning Director; Pedro Rojas, Social Media Manager; Pancho González, CCO, Inbrax (CH), Oana Oprea, Head of Digital Planning, Jam Session Agency (RO), Amy Bottrill, Social Account Director, Launch (UK), Gaby Arriaga, Founder, Leonardo1452 (MX), Shantesh S Row, Creative Director, Liwa (UAE), Rajesh Mehta, Chief Strategy Officer; Dhruv Gaur, Digital Planning Lead; Leonie Mergulhao, Account Supervisor - Social Media & PR, Medulla (IN), Aurelija Plioplytė, Head of Digital & Social, Not Perfect (LI), Daiana Khaidargaliyeva, Account Manager, Osaka Labs (UK / USA), Stefanie Söhnchen, Vice President Digital, PIABO Communications (DE), Elisabeth Winiartati, Managing Consultant, Head of Global Integrated Communications; Lydia Aprina, Account Manager, Integrated Marketing and Communications; Nita Prabowo, Account Manager, Integrated Marketing and Communications; Okhi, Web Developer, PNTR Group (ID), Kei Obusan, Insights Director; Daffi Ranandi, Insights Manager, Radarr (SG), Gautam Reghunath, Co-founder & CEO, Talented (IN), Donagh Humphreys, Head of Social and Digital Innovation, THINKHOUSE (IRE), Sarah Yim, Strategy Director, Zulu Alpha Kilo (CA).

Social Media Marketing Trends 2024 // The Global Indie Insights

Kurio // The Social Media Age(ncy)

The search marketing landscape is evolving rapidly with new technologies, and professionals, like you, rely on innovative paid search strategies to meet changing demands. It’s important that you’re ready to implement new strategies in 2024. Check this out and learn the top trends in paid search advertising that are expected to gain traction, so you can drive higher ROI more efficiently in 2024. You’ll learn: - The latest trends in AI and automation, and what this means for an evolving paid search ecosystem. - New developments in privacy and data regulation. - Emerging ad formats that are expected to make an impact next year. Watch Sreekant Lanka from iQuanti and Irina Klein from OneMain Financial as they dive into the future of paid search and explore the trends, strategies, and technologies that will shape the search marketing landscape. If you’re looking to assess your paid search strategy and design an industry-aligned plan for 2024, then this webinar is for you.

Trends In Paid Search: Navigating The Digital Landscape In 2024

Search Engine Journal

From their humble beginnings in 1984, TED has grown into the world’s most powerful amplifier for speakers and thought-leaders to share their ideas. They have over 2,400 filmed talks (not including the 30,000+ TEDx videos) freely available online, and have hosted over 17,500 events around the world. With over one billion views in a year, it’s no wonder that so many speakers are looking to TED for ideas on how to share their message more effectively. The article “5 Public-Speaking Tips TED Gives Its Speakers”, by Carmine Gallo for Forbes, gives speakers five practical ways to connect with their audience, and effectively share their ideas on stage. Whether you are gearing up to get on a TED stage yourself, or just want to master the skills that so many of their speakers possess, these tips and quotes from Chris Anderson, the TED Talks Curator, will encourage you to make the most impactful impression on your audience. See the full article and more summaries like this on SpeakerHub here: https://speakerhub.com/blog/5-presentation-tips-ted-gives-its-speakers See the original article on Forbes here: http://www.forbes.com/forbes/welcome/?toURL=http://www.forbes.com/sites/carminegallo/2016/05/06/5-public-speaking-tips-ted-gives-its-speakers/&refURL=&referrer=#5c07a8221d9b

5 Public speaking tips from TED - Visualized summary

SpeakerHub

Everyone is in agreement that ChatGPT (and other generative AI tools) will shape the future of work. Yet there is little consensus on exactly how, when, and to what extent this technology will change our world. Businesses that extract maximum value from ChatGPT will use it as a collaborative tool for everything from brainstorming to technical maintenance. For individuals, now is the time to pinpoint the skills the future professional will need to thrive in the AI age. Check out this presentation to understand what ChatGPT is, how it will shape the future of work, and how you can prepare to take advantage.

ChatGPT and the Future of Work - Clark Boyd

Clark Boyd

Getting into the tech field. what next

Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search Intent

Lily Ray

How to have difficult conversations

Rajiv Jayarajah, MAppComm, ACC

Introduction to Data Science

Christy Abraham Joy

Time Management & Productivity - Best Practices

Vit Horky

The six step guide to practical project management If you think managing projects is too difficult, think again. We’ve stripped back project management processes to the basics – to make it quicker and easier, without sacrificing the vital ingredients for success. “If you’re looking for some real-world guidance, then The Six Step Guide to Practical Project Management will help.” Dr Andrew Makar, Tactical Project Management

The six step guide to practical project management

MindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

RachelPearson36

During this webinar, Anand Bagmar demonstrates how AI tools such as ChatGPT can be applied to various stages of the software development life cycle (SDLC) using an eCommerce application case study. Find the on-demand recording and more info at https://applitools.info/b59 Key takeaways: • Learn how to use ChatGPT to add AI power to your testing and test automation • Understand the limitations of the technology and where human expertise is crucial • Gain insight into different AI-based tools • Adopt AI-based tools to stay relevant and optimize work for developers and testers * ChatGPT and OpenAI belong to OpenAI, L.L.C.

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

Applitools

12 Ways to Increase Your Influence at Work

GetSmarter

ChatGPT webinar slides

Alireza Esmikhani

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Project for Public Spaces & National Center for Biking and Walking

Has your project been caught in a storm of deadlines, clashing requirements, and the need to change course halfway through? If yes, then check out how the administration team navigated through all of this, relocating 160 people from 3 countries and opening 2 offices during the most turbulent time in the last 20 years. Belka Games’ Chief Administrative Officer, Katerina Rudko, will share universal approaches and life hacks that can help your project survive unstable periods when there seem to be too many tasks and a lack of time and people.

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

DevGAMM Conference

Featured (20)

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

ChatGPT webinar slides

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Automatic Feature Generation & Amdahl's Law

1. Automated Feature Generation & Amdahl's Law By: Simon Ouellette (www.simonouellette.com)

2. 1. What are machine learning features?

5. Examples ● Spam detection: presence or absence of certain email headers, the email structure, the frequency of specific terms, etc. ● Time-series analysis: lags of a variable, differentials of a variable, polynomial forms, logarithms, etc. ● Computer vision: pixel color histograms, edges, etc.

6. Feature generation ● This step is generally the most expensive/difficult in the machine learning "pipeline" ● No science to it, it's a black art (not a deductive process, but an inductive one) ● Relies on a combination of domain knowledge and trial and error.

7. 2. Can we automate feature discovery?

8. The discovery process ● For the moment let's abstract out the domain knowledge half of the equation. ● It boils down to a search/optimization problem: find optimum multi-dimensional point in the parameter space.

9. Parameter space

10. Theory vs Practice: a disclaimer ● In theory, the parameter space is infinite- dimensional (there's an infinity of transformations you can apply to the raw input variables!) ● In practice, it's finite, but you can still never claim to have searched the entire space of possible solutions. ● Bottom line: utility. Is automation a useful, time saving tool for data scientist?

11. Domain knowledge ● D.K. is a significant optimization that humans use, and that is difficult for a machine to use. ● Semantic Web and knowledge ontologies: ● User-specified domain knowledge – Variable type (quantitative, categorical, text, entity, etc.) – Time series or cross-sectional dataset – Concept inheritance – Known relationships / features of significance ● Access to public repositories of d.k.

12. 3. Nectarine Demo

13. 4. Challenges in parallelism

14. Amdahl's law

15. My encounter with Amdahl's law ● Ran my feature discovery algorithm on 100 millions row of data, and put it on the cloud to make full use of parallelization. ● Quickly discovered that my algo didn't scale beyond 8 cores. Adding cores didn't improve the performance.

18. Cause: low-level parallelism ● Lot more overhead, shuffle, etc. ● Control flow often implies collecting intermediate results to branch on them (.count(), .reduce(), .sum(), .first(), etc.) – hence non-parallel bottlenecks ● Every line of code not in a Spark closure is a non- parallel bottleneck! (compare with mapPartitions() approach of high-level parallelism)

19. LASSOWithSGD ● LASSOWithSGD: SGD is Stochastic Gradient Descent (an optimization algorithm). The cost function is the non-parallelizable section. ● Gradient Descent is an iterative algorithm that finds the minimum point (i.e. minimum model error) in the space of model error per parameter values.

21. LASSOWithSGD ● Calculating the gradients of the points can be done in parallel. ● But you have to reduce to a local sum, which implies serializing, transfering to the driver, deserializing on the driver. Then you have to update the weights, serialize send back to the executors, and move on to the next step. ● This is what happens, as a general rule, with low-level parallelism. You have to aggregate results locally to proceed to a next iteration. This is non-parallelizable section of the code.

22. Solution: High-level parallelism ● Changed my machine learning libs to traditional single-threaded ones (used SMILE) ● Instead I'm putting everything in a mapPartitions() closure. ● The data is broadcast to all executors. ● No shuffle, only 1 local aggregation at the end of the analysis. Highly parallelizable.

24. Conclusions ● High-level parallelism limitation: Dataset must fit inside an executor's memory! (so not Big Data! A few Gigabytes at most!) ● MLlib is for real Big Data (Tera bytes +). ● Even if your feature space is Big Data, you're probably better off without MLlib.

25. Recruiting! ● Data scientists and developers ● Full-time, part-time, sweat equity, etc.

Automatic Feature Generation & Amdahl's Law

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Automatic Feature Generation & Amdahl's Law