The document provides guidance on building an end-to-end machine learning project to predict California housing prices using census data. It discusses getting real data from open data repositories, framing the problem as a supervised regression task, preparing the data through cleaning, feature engineering, and scaling, selecting and training models, and evaluating on a held-out test set. The project emphasizes best practices like setting aside test data, exploring the data for insights, using pipelines for preprocessing, and techniques like grid search, randomized search, and ensembles to fine-tune models.
Abstract: This PDSG workshop introduces basic concepts of splitting a dataset for training a model in machine learning. Concepts covered are training, test and validation data, serial and random splitting, data imbalance and k-fold cross validation.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
This Edureka Python tutorial (Python Tutorial Blog: https://goo.gl/wd28Zr) gives an introduction to Machine Learning and how to implement machine learning algorithms in Python. Below are the topics covered in this tutorial:
1. Why Machine Learning?
2. What is Machine Learning?
3. Types of Machine Learning
4. Supervised Learning
5. KNN algorithm
6. Unsupervised Learning
7. K-means Clustering Algorithm
Abstract: This PDSG workshop introduces basic concepts of splitting a dataset for training a model in machine learning. Concepts covered are training, test and validation data, serial and random splitting, data imbalance and k-fold cross validation.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
This Edureka Python tutorial (Python Tutorial Blog: https://goo.gl/wd28Zr) gives an introduction to Machine Learning and how to implement machine learning algorithms in Python. Below are the topics covered in this tutorial:
1. Why Machine Learning?
2. What is Machine Learning?
3. Types of Machine Learning
4. Supervised Learning
5. KNN algorithm
6. Unsupervised Learning
7. K-means Clustering Algorithm
Provides a brief overview of what machine learning is, how it works (theory), how to prepare data for a machine learning problem, an example case study, and additional resources.
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
A short presentation for beginners on Introduction of Machine Learning, What it is, how it works, what all are the popular Machine Learning techniques and learning models (supervised, unsupervised, semi-supervised, reinforcement learning) and how they works with various Industry use-cases and popular examples.
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaEdureka!
YouTube: https://youtu.be/xtOg44r6dsE
(** Python Data Science Training: https://www.edureka.co/python **)
In this PPT on Supervised vs Unsupervised vs Reinforcement learning, we’ll be discussing the types of machine learning and we’ll differentiate them based on a few key parameters. The following topics are covered in this session:
1. Introduction to Machine Learning
2. Types of Machine Learning
3. Supervised vs Unsupervised vs Reinforcement learning
4. Use Cases
Python Training Playlist: https://goo.gl/Na1p9G
Python Blog Series: https://bit.ly/2RVzcVE
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
A brief presentation given on the basics of Ensemble Methods. Given as a 'Lightning Talk' during the 7th Cohort of General Assembly's Data Science Immersive Course
This presentation on Recurrent Neural Network will help you understand what is a neural network, what are the popular neural networks, why we need recurrent neural network, what is a recurrent neural network, how does a RNN work, what is vanishing and exploding gradient problem, what is LSTM and you will also see a use case implementation of LSTM (Long short term memory). Neural networks used in Deep Learning consists of different layers connected to each other and work on the structure and functions of the human brain. It learns from huge volumes of data and used complex algorithms to train a neural net. The recurrent neural network works on the principle of saving the output of a layer and feeding this back to the input in order to predict the output of the layer. Now lets deep dive into this presentation and understand what is RNN and how does it actually work.
Below topics are explained in this recurrent neural networks tutorial:
1. What is a neural network?
2. Popular neural networks?
3. Why recurrent neural network?
4. What is a recurrent neural network?
5. How does an RNN work?
6. Vanishing and exploding gradient problem
7. Long short term memory (LSTM)
8. Use case implementation of LSTM
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you'll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
And according to payscale.com, the median salary for engineers with deep learning skills tops $120,000 per year.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
Learn more at: https://www.simplilearn.com/
Module 4: Model Selection and EvaluationSara Hooker
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
Provides a brief overview of what machine learning is, how it works (theory), how to prepare data for a machine learning problem, an example case study, and additional resources.
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
A short presentation for beginners on Introduction of Machine Learning, What it is, how it works, what all are the popular Machine Learning techniques and learning models (supervised, unsupervised, semi-supervised, reinforcement learning) and how they works with various Industry use-cases and popular examples.
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaEdureka!
YouTube: https://youtu.be/xtOg44r6dsE
(** Python Data Science Training: https://www.edureka.co/python **)
In this PPT on Supervised vs Unsupervised vs Reinforcement learning, we’ll be discussing the types of machine learning and we’ll differentiate them based on a few key parameters. The following topics are covered in this session:
1. Introduction to Machine Learning
2. Types of Machine Learning
3. Supervised vs Unsupervised vs Reinforcement learning
4. Use Cases
Python Training Playlist: https://goo.gl/Na1p9G
Python Blog Series: https://bit.ly/2RVzcVE
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
A brief presentation given on the basics of Ensemble Methods. Given as a 'Lightning Talk' during the 7th Cohort of General Assembly's Data Science Immersive Course
This presentation on Recurrent Neural Network will help you understand what is a neural network, what are the popular neural networks, why we need recurrent neural network, what is a recurrent neural network, how does a RNN work, what is vanishing and exploding gradient problem, what is LSTM and you will also see a use case implementation of LSTM (Long short term memory). Neural networks used in Deep Learning consists of different layers connected to each other and work on the structure and functions of the human brain. It learns from huge volumes of data and used complex algorithms to train a neural net. The recurrent neural network works on the principle of saving the output of a layer and feeding this back to the input in order to predict the output of the layer. Now lets deep dive into this presentation and understand what is RNN and how does it actually work.
Below topics are explained in this recurrent neural networks tutorial:
1. What is a neural network?
2. Popular neural networks?
3. Why recurrent neural network?
4. What is a recurrent neural network?
5. How does an RNN work?
6. Vanishing and exploding gradient problem
7. Long short term memory (LSTM)
8. Use case implementation of LSTM
Simplilearn’s Deep Learning course will transform you into an expert in deep learning techniques using TensorFlow, the open-source software library designed to conduct machine learning & deep neural network research. With our deep learning course, you'll master deep learning and TensorFlow concepts, learn to implement algorithms, build artificial neural networks and traverse layers of data abstraction to understand the power of data and prepare you for your new role as deep learning scientist.
Why Deep Learning?
It is one of the most popular software platforms used for deep learning and contains powerful tools to help you build and implement artificial neural networks.
Advancements in deep learning are being seen in smartphone applications, creating efficiencies in the power grid, driving advancements in healthcare, improving agricultural yields, and helping us find solutions to climate change. With this Tensorflow course, you’ll build expertise in deep learning models, learn to operate TensorFlow to manage neural networks and interpret the results.
And according to payscale.com, the median salary for engineers with deep learning skills tops $120,000 per year.
You can gain in-depth knowledge of Deep Learning by taking our Deep Learning certification training course. With Simplilearn’s Deep Learning course, you will prepare for a career as a Deep Learning engineer as you master concepts and techniques including supervised and unsupervised learning, mathematical and heuristic aspects, and hands-on modeling to develop algorithms. Those who complete the course will be able to:
Learn more at: https://www.simplilearn.com/
Module 4: Model Selection and EvaluationSara Hooker
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
7 Cases Where You Can't Afford to Skip Analytics TestingObservePoint
Data isn't inherently true—true data is true. So for your data strategy to work, you need to verify your analytics data is telling the truth.
In 7 Cases Where You Can’t Afford To Skip Analytics Testing, ObservePoint's VP Customer Success Patrick Hillery walks through how to avoid bad data quality at 7 critical breakage points. Hillery explains how to:
Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things.
Machine Learning with Azure and Databricks Virtual WorkshopCCG
Join CCG and Microsoft for a hands-on demonstration of Azure’s machine learning capabilities. During the workshop, we will:
- Hold a Machine Learning 101 session to explain what machine learning is and how it fits in the analytics landscape
- Demonstrate Azure Databricks’ capabilities for building custom machine learning models
- Take a tour of the Azure Machine Learning’s capabilities for MLOps, Automated Machine Learning, and code-free Machine Learning
By the end of the workshop, you’ll have the tools you need to begin your own journey to AI.
Lecture-1-Introduction to Deep learning.pptxJayChauhan100
Introduction To Deep Learning.
This presentation covers everything about deep learning. You will be familier with all the main concepts used in deep learning.
Includes topics like difference between deep learning and machine learning, Feature engineering in detail, Deep learning frameworks , applications of deep learning etc.
This presentation will surely help you to know about the deep learning.
For queries contact on the given email id.
Email - chauhanjay657@gmail.com
If you have heard about machine learning and want to try out some of it, please check this out. In this article I am just trying to jot down few basics and must know stuff to kick start in this field. The objective of this compilation; to trigger the interest in this field of data analytics and to demystify the abstract concept. This article is not for the advanced data scientists, this is for the beginners or those who want a quick refresher.
Efficiently Removing Duplicates from a Sorted ArrayEng Teong Cheah
The RemoveDuplicates method efficiently removes duplicates from a sorted array in-place using a two-pointer technique, ensuring a time complexity of O(n) and a space complexity of O(1). This approach maintains the order of elements and requires no additional data structures.
After a model has been deployed, it's important to understand how the model is being used in production, and to detect any degradation in its effectiveness due to data drift. This module describes tech- niques for monitoring models and their data.
Data scientists have a duty to ensure they analyze data and train machine learning models responsibly; respecting individual privacy, mitigating bias, and ensuring transparency. This module explores some considerations and techniques for applying responsible machine learning principles.
By this stage of the course, you've learned the end-to-end process for training, deploying, and consum- ing machine learning models; but how do you ensure your model produces the best predictive outputs for your data? In this module, you'll explore how you can use the azure Machine Learning SDK to apply hyperparameter tuning and automated machine learning, and find the best model for your data.
Models are designed to help decision making through predictions, so they're only useful when deployed and available for an application to consume. In this module learn how to deploy models for real-time inferencing, and for batch inferencing.
Now that you understand the basics of running workloads as experiments that leverage data assets and compute resources, it's time to learn how to orchestrate these workloads as pipelines of connected steps. Pipelines are key to implementing an effective Machine Learning Operationalization (ML Ops) solution in Azure, so you'll explore how to define and run them in this session.
One of the key benefits of the cloud is the ability to leverage compute resources on demand, and use them to scale machine learning processes to an extent that would be infeasible on your own hardware.
Data is a fundamental element in any machine learning workload, so in this module, you will learn how to create and manage datastores and datasets in an Azure Machine Learning workspace, and how to use them in model training experiments.
This module introduces the Automated Machine Learning and Designer visual tools, which you can use to train, evaluate, and deploy machine learning models without writing any code.
You will learn how to provision an Azure Machine Learning workspace and use it to manage machine learning assets such as data, compute, model training code, logged metrics, and trained models. You will learn how to use the web-based Azure Machine Learning studio interface as well as the Azure Machine Learning SDK and developer tools like Visual Studio Code and Jupyter Notebooks to work with the assets in your workspace.
The mechanism that Docker and several other container runtimes use is known as a UnionFS. To best understand a union file system, consider a set of clear pieces of transparent paper.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
3. Working with Real Data?
When you are learning about Machine Learning it is best to actually
experiment with real-world data, not just artificial datasets. Fortunately, there
are thousands of open datasets to choose from, ranging across all sorts of
domains. Here are a few places you can look to get data:
- Popular open data repositories:
- UC Irvine Machine Learning Repository
- Kaggle datasets
- Amazon’s AWS datasets
- Meta portals (they list open data repositories):
- http://dataportals.org/
- http://opendatamonitor.eu/
- http://quandl.com/
4. Working with Real Data?
- Other pages listing many popular open data repositories:
- Wikipedia’s list of Machine Learning datasets
- Quora.com question
- Datasets subreddit
5. Look at the Big Picture
The first task you are asked to perform is to build a model of housing prices in
California using the California census data. This data has metrics such as the
population, median income, median housing price, and so on for each block
group California. Block groups are the smallest geographical unit for which the
US Census Bureau publishes sample data ( a block group typically has a
population of 600 to 3,000 people). We will just call them “districts” for shot.
Your model should learn from this data and be able to predict the median
housing price in any district, given all the other metrics.
6. Frame the Problem
Question:
What exactly is the business objective; building a model is probably not the end
goal. How does the company expect to use and benefit form this model?
This is important because it will determine how you frame the problem, what
algorithms you will select, what performance measure you will use to evaluate
your model, and how much effort you should spend tweaking it.
7. Frame the Problem
Answer:
Your model’s output (a prediction of a district’s, median housing price) will be
fed to another Machine Learning system, along with many other signals. This
downstream system will determine whether it is worth investing in a given area
or not. Getting this right is critical, as it directly affects revenue.
8. Frame the Problem
Question:
What the current solution looks like (if any). It will often give you a reference
performance, as well as insights on how to solve the problem.
9. Frame the Problem
Answer:
The district housing prices are currently estimated manually by experts: a team
gathers up-to-data information about a district (excluding median housing
prices), and they use complex rules to come up with an estimate. This is costly
and time-consuming, and their estimates are not great; their typical error rate
is about 15%.
10. Frame the Problem
With all this information you are now ready to start designing your system.
First, you need to frame the problem:
Is it supervised, unsupervised, or Reinforcement Learning?
Is it a classification task, a regression task, or something else?
Should you use batch learning or online learning techniques?
11. It is clearly a typical supervised learning task since you are
given labeled training examples (each instance comes with the
expected output, i.e., the district’s median housing price).
Moreover, it is also a typical regression task, since you are
asked to predict the value. More specifically, this is a
multivariate regression problem since the system will use
multiple features to make a prediction (it will use the district’s
population, the median income, etc.)
You predicted life satisfaction based on just one feature, the
GDP per capita, so it was a univariate regression problem.
Finally, there is no continuous flow of data coming in the
system, there is no particular need to adjust to changing data
rapidly, and the data is small enough to fit in memory, so plain
batch learning should do just fine.
12. Select a Performance Measure
A typical performance measure for regression problems is the Root Mean
Square Error (RMSE). It measures the standard deviation of the errors the
system makes in its prediction.
13. Check the Assumptions
It is good practice to list and verify the assumptions that were made so far (by
you or others); this can catch serious issues early on.
For example, the district prices that your system, and we assume that these
prices are going to be used as such. But what if the downstream system
actually converts the prices into categories (e.g., “cheap”, “medium”, or
“expensive”) and then uses those categories instead of the prices themselves?
14. Check the Assumptions
In this case, getting the price perfectly right is not important at all; your system
just needs to get the category right. If that’s so, then the problem should have
been framed as a classification, not a regression task. You don’t want to find
this out after working on a regression system for a months.
Fortunately, after talking with the team in charge of the downstream system,
you are confident that they do indeed need the actual prices, not categories.
15. Get the Data
It’s time to get your hands dirty.
Don’t hesitate to pick up your laptop and walk through the following code
examples in a Jupyter notebook.
16. Create the Workspace
1. You need to have Python installed.
2. Create a workspace directory for your Machine Learning code and datasets.
You need a number of Python modules: Jupyter, NumPy, Pandas, Matplotlib
and Scikit-Learn.
17. Download the Data
In typical environments your data would be available in a relational database
(or some other common datastore) and spread across multiple
tables/documents/files.
To access it, you would first need to get your credentials and access
authorizations, and familiarize yourself with the data schema.
In this project, however, things are much simpler: you will download a single
compressed file, housing.tgz, which contains a comma-separated value (CSV)
file called housing.csv with all the data.
19. Take a Quick Look at the Data Structure
Let’s take a look at the top five rows using the DataFrame’s head() method
20. Take a Quick Look at the Data Structure
The info() method is useful to get a quick description of the data, in particular
the total number of rows, and each attribute’s type and number of non-full
values.
21. Take a Quick Look at the Data Structure
Call the hist() method on the whole dataset, and it will plot a histogram for
each numerical attribute. For example, you can see that slightly over 800
districts have a median_house_value equal to about $500,000.
22. Create a Test Set
It may sound strange to voluntarily set aside part of the data at this stage.
After all, you have only taken a quick glance at the data, and surely you should
learn a whole lot more about it before you decide what algorithms to use,
right?
This is true, but your brain is an amazing pattern detection system, which
means that it is highly prone to overfitting: if you look at the test set, you may
stumble upon some seemingly interesting pattern in the test data that leads
you to select a particular kind of Machine Learning model.
When you estimate the generalization error using the test set, you estimate will
be too optimistic and you will launch a system that will not perform as well as
expected. This is called data snooping bias.
23. Create a Test Set
Creating a test set is theoretically quite simple: just pick some instances
randomly, typically 20% of the dataset, and set them aside:
24. Create a Test Set
You can then use this function like this:
25. Discover and Visualize the Data to Gain Insights
First, make sure you have put the test set aside and you are only exploring the
training set. Also, if the training set is very large, you may want to sample an
exploration set, to make manipulations easy and fast. In our case, the set is
quite small so you can just work directly on the full set.
26. Visualizing Geographical Data
Since there is geographical information (latitude and longitude), it is a good
idea to create a scatterplot of all districts to visualize the data.
27. Visualizing Geographical Data
This looks like California all right, but other than that it is hard to see any
particular pattern. Setting the alpha option to 0.1 makes it much easier to
visualize the places where there is a high density of data points
28. Looking for Correlations
The correlation coefficient ranges from -1 to 1.
When it is close to 1, it means that there is a strong positive correlations.
For example, the median house value tends to go up when then median
income goes up.
When coefficient is close to -1, it means that there is a strong negative
correlation; you can see a small negative correlation between the latitude and
the median house value(i.e., prices have a slight tendency to go down when
you go north.
Finally, coefficients close to zero mean that there is no linear correlation.
30. Experimenting with Attribute Combinations
One last thing you may want to do before actually preparing the data for
Machine Learning algorithms is to try out various attribute combinations.
For example, the total number of rooms in a district is not very useful if you
don’t know how many households there are.
What you really want is the number of rooms per household.
Similarly, the total number of bedrooms by itself is not very useful: you
probably want to compare it to the number of rooms. And the population per
household also seems like an interesting attribute combination to look at.
31. Prepare the Data for Machine Learning Algorithms
Instead of just doing this manually, you should write functions to do that, for
several good reasons:
1. This will allow you to reproduce these transformations easily on any dataset
(e.g., the next time you get the fresh datasets).
2. You will gradually build a library of transformation functions that you can
reuse in future projects.
3. You can use these functions in your live system to transform the new data
before feeding it to your algorithms.
4. This will make it possible for you to easily try various transformations and
see which combination of transformation works best.
32. Data Cleaning
Most of Machine Learning algorithms cannot work with missing features, so
let’s create a few functions to take care of them. You noticed earlier that the
total_bedrooms attribute has some missing values, so let’s fix this. You have
three options:
1. Get rid of the corresponding districts.
2. Get rid of the whole attribute.
3. Set the values to some value (zero, the mean, the median, etc.)
33. Handling Text and Categorical Attributes
Earlier we left out the categorical attribute ocean_proximity because it is a text
attribute so we cannot compute its median. Most Machine Learning algorithms
prefer to work with numbers anyway.
34. Custom Transformers
Although Scikit-Learn provides many useful transformers, you will need to write
your own for task such as custom cleanup operations or combining specific
attributes. You will want your transformer to work seamlessly with Scikit-Learn
functionalities(such as pipelines), and since Scikit-Learn relies on duck
typing(not inheritance), all you need is to create a class and implement 3
methods: fit() (returning self), transform(), and fit_transform().
35. Feature Scaling
One of the most important transformation you need to apply to your data is
feature scaling.
With few exceptions, Machine Learning algorithms don’t perform well when
the input numerical attributes have very different scales.
This is the case for the housing data: the total number of rooms ranges from
about 6 to 39,320, while the median incomes only range from 0 to 15. Note
that scaling the target values is generally not required.
There are two common ways to get all attributes to have the same scale:
min-max scaling and standardization.
36. Transformation Pipelines
There are many data transformation steps that need be executed in the right
order. Fortunately, Scikit-Learn provides the Pipeline class to help with such
sequences of transformations. Here is a small pipelines for the numerical
attributes:
37. Select and Train a Model
At last! You framed the problem, you got the data and explored it, you sampled
a training set and a test set, and you wrote transformation pipelines to clean up
and prepare your data for Machine Learning algorithms automatically. You are
now ready to select and train a Machine Learning model.
38. Fine-Tune Your Model
Let’s assume that you now have a shortlist of promising models. You now need
to fine-tune them.
39. Grid Search
One way to do that would be to fiddle with the hyperparameters manually,
until you find a great combination of hyperparameter values. This would be
very tedious work, and you may not have time to explore many combinations.
Instead you should get Scikit-Learn’s GridSearchCV to search for you. All you
need to do is tell it which hyperparameters you want it to experiment with, and
what values to try out, and it will evaluate all the possible combinations of
hyperparameters values, using cross-validation.
40. Randomized Search
This approach has 2 main benefits:
- If you let the randomized search run for, say, 1,000 iterations, this approach
will explore, 1,000 different values for each hyperparameter(instead of just a
few values per hyperparameter with the grid search approach).
- You have more control over the computing budget you want to allocate to
hyper-parameter search, simply by setting the number of iterations.
41. Randomized Search
This approach has 2 main benefits:
- If you let the randomized search run for, say, 1,000 iterations, this approach
will explore, 1,000 different values for each hyperparameter(instead of just a
few values per hyperparameter with the grid search approach).
- You have more control over the computing budget you want to allocate to
hyper-parameter search, simply by setting the number of iterations.
42. Ensemble Methods
Another way to fine-tune your system is to try to combine the models that
perform best. The group (or “ensemble”) will often perform better than the
best individual model (just like Random Forests perform better than the
individual Decision Tree they rely on), especially if the individual models very
different types of errors.
43. Evaluate Your System on the Test Set
After tweaking your models for a while, you eventually have a system that
performs sufficiently well. Now is the time to evaluate the final model on the
test set. There is nothing special about this process; just get the predictors and
the labels from your test set, run your full_pipelines to transform the data (call
transform(), not fit_transform()!), and evaluate the final model on the test set.
44. Launch, Monitor and Maintain Your System
Finally, you will generally want to train your models on a regular basis using
fresh data. You should automate this process as much as possible.
If you don’t, you are very likely to refresh your model only every six months (at
best), and your system’s performance may fluctuate severely over time.
If your system is an online learning system, you should make sure you save
snapshots of its state at regular intervals so you can easily roll back to a
previously working state.
45. Thanks!
Does anyone have any questions?
Twitter: @walkercet
Blog: https://ceteongvanness.wordpress.com