Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Machine Learning (ML) is a
process of induction.
ML heavily rely on training
datasets to train their model
and ‘learn’ better.
Training data helps train the ML
program for building a particular
type of modeling.
Once this is done, it is passed
through actual data that it hasn’t
been trained on, using test dataset.
Hence test dataset is the data for
which the ML program was trained
using the training dataset.
Both training and test datasets
will try to align to representative
population samples.
This ensures that the outcomes
wil...
If you are looking for some high
quality data sources to build your
training datasets, then read on to
explore some of the...
1. UCI- Machine Learning repository
2. Iris by UCI
3. Kaggle
4. Andbrain
5. ML Bench by R
6. MIAS
7. Autonlab
8. Mulan
9. ...
What factors are to
be considered when
building training
datasets?
We’ve listed some of them below.
1. The right quantity
You need to assess and have an answer ready for these
basic questions around the quantity of data
• ...
Jeff Dean, the head of the
Google Brain project stated that
deep learning takes at least
100,000 examples.
2. The approach to splitting data
• You need data to build the model, and you need
data to test the model.
• There should ...
Some datasets need other approaches like
stratified sampling or clustered sampling.
If you really aren’t sure, do a small ...
3. The past
• You can check out studies that have problems
similar to your current problem and take the data
for better ef...
4. Domain expertise
• Typically, the samples you feed in need to possess
two key qualities – independence and identical
di...
5. The right kind of data
transformation
• Once you have processed the clean data, you can
transform it based on your mach...
Feature engineering can comprise one or
more of the following data transformation
processes.
Scaling
• Normally a processed dataset will have attributes
that use a variety of scales for metrics such as
weights (kilo...
Decomposition
• With the help of functional decomposition, a
complex variable can be split into granular level
into its co...
The way a Bayesian network method tries
to split a joint distribution along its causal
fault line, is a classic example of...
Aggregation
• It combines multiple variables featuring similar
attributes into a single bigger entity.
• For some machine ...
An example can be how aggregate survey
responses can be tracked rather than looking
at individual responses, to solve a pa...
Identifying the type of algorithm
in development
• Knowing what type of algorithm you are running
after, you will be able ...
• Typically, non-linear algorithms are considered
more powerful. They are able to grasp and
establish connections in non-l...
• This also means that non-linear algorithm needs
much more volume of data inside the training
dataset for it to grasp the...
Identifying correctly ‘if’ and
‘when’ big data is required
• When you're building a training dataset, you need
to assess s...
• A classic example is when you are carrying out
traditional predictive modeling, you may reach a
point of diminishing ret...
To conclude
• Building a training dataset drives the quality of the
overall machine learning model.
• With these factors, ...
Are you looking to acquire web data for
your business?
Let us know your requirements at
sales@promptcloud.com
Factors that make a great Machine Learning training data set
Upcoming SlideShare
Loading in …5
×

Factors that make a great Machine Learning training data set

Building a training data set drives the quality of the overall machine learning model. If you are looking for some high quality data sources to build your training data sets, then read on to explore some of the useful options.

  • Be the first to comment

Factors that make a great Machine Learning training data set

  1. 1. Machine Learning (ML) is a process of induction.
  2. 2. ML heavily rely on training datasets to train their model and ‘learn’ better.
  3. 3. Training data helps train the ML program for building a particular type of modeling.
  4. 4. Once this is done, it is passed through actual data that it hasn’t been trained on, using test dataset.
  5. 5. Hence test dataset is the data for which the ML program was trained using the training dataset.
  6. 6. Both training and test datasets will try to align to representative population samples. This ensures that the outcomes will be universally applicable for this sample.
  7. 7. If you are looking for some high quality data sources to build your training datasets, then read on to explore some of the useful options.
  8. 8. 1. UCI- Machine Learning repository 2. Iris by UCI 3. Kaggle 4. Andbrain 5. ML Bench by R 6. MIAS 7. Autonlab 8. Mulan 9. DataStock by PromptCloud
  9. 9. What factors are to be considered when building training datasets? We’ve listed some of them below.
  10. 10. 1. The right quantity You need to assess and have an answer ready for these basic questions around the quantity of data • The number of records to take from the databases • The size of the sample needed to yield expected performance outcomes • The split of data for training and testing or use an alternate approach like k-fold cross validation
  11. 11. Jeff Dean, the head of the Google Brain project stated that deep learning takes at least 100,000 examples.
  12. 12. 2. The approach to splitting data • You need data to build the model, and you need data to test the model. • There should be a method to split the dataset into these two portions. You can go for random split or time based split. • In the latter, the general rule of thumb is that older data is for training and newer data is for testing.
  13. 13. Some datasets need other approaches like stratified sampling or clustered sampling. If you really aren’t sure, do a small pilot to validate your model and then roll it full- fledged across the board.
  14. 14. 3. The past • You can check out studies that have problems similar to your current problem and take the data for better efficacy of the model building process. • If you are fortunate enough to get a big number of similar studies carried out in the past, you can average out over them for your building purposes.
  15. 15. 4. Domain expertise • Typically, the samples you feed in need to possess two key qualities – independence and identical distribution. • To determine the quality of data, have a subject matter expert run a trained pair of eyes through the data. • The expert can also help to simulate data that you don’t have currently but wish to use to train the machine learning program.
  16. 16. 5. The right kind of data transformation • Once you have processed the clean data, you can transform it based on your machine learning training objectives. • This step of feature engineering helps in transforming the data into one best suited for a particular type of analysis.
  17. 17. Feature engineering can comprise one or more of the following data transformation processes.
  18. 18. Scaling • Normally a processed dataset will have attributes that use a variety of scales for metrics such as weights (kilograms or pounds), distance (kilometers or miles), or currency (dollars or euros). • You will need to reduce the variations in the scale for a much better result.
  19. 19. Decomposition • With the help of functional decomposition, a complex variable can be split into granular level into its constituent parts. • These individual constituent parts may have some inherent properties or characteristics that can augment in the entire machine learning building process. • It helps to separate the ‘noise’ from the elements or components you are actually interested in for building the training datasets.
  20. 20. The way a Bayesian network method tries to split a joint distribution along its causal fault line, is a classic example of decomposition at work.
  21. 21. Aggregation • It combines multiple variables featuring similar attributes into a single bigger entity. • For some machine learning datasets, this may be a more sensible way to build the dataset for solving a particular problem.
  22. 22. An example can be how aggregate survey responses can be tracked rather than looking at individual responses, to solve a particular problem through machine learning.
  23. 23. Identifying the type of algorithm in development • Knowing what type of algorithm you are running after, you will be able to better assess the type and quantity of data needed for building the training dataset. • You can go for a linear or a non-linear algorithm.
  24. 24. • Typically, non-linear algorithms are considered more powerful. They are able to grasp and establish connections in non-linear relationships between the input and output features. • They can figure out not only how many parameters are required but also determine what values to be present for these parameters to better resolve a specific machine learning problem
  25. 25. • This also means that non-linear algorithm needs much more volume of data inside the training dataset for it to grasp the complex connections and relationships between different entities being analyzed. • Most of the better known enterprises are interested in such algorithms that keep improving as more and more data is input into their system.
  26. 26. Identifying correctly ‘if’ and ‘when’ big data is required • When you're building a training dataset, you need to assess smartly if at all big data (very high volume of data) is needed. • If so, then at what point of the dataset creation, should we bring in the big data.
  27. 27. • A classic example is when you are carrying out traditional predictive modeling, you may reach a point of diminishing returns where the yields will not correspond to the amount of data you have input. You may need far more data to overcome this barrier. • By carefully assessing your chosen model and your specific problem in hand, you can figure out when this point will arrive and when you would need a much bigger volume of data.
  28. 28. To conclude • Building a training dataset drives the quality of the overall machine learning model. • With these factors, you can make certain that you build a high performance machine learning dataset and reap the benefit of a robust, meaningful, and accurate machine learning model that has ‘learnt’ from such a superior training dataset.
  29. 29. Are you looking to acquire web data for your business? Let us know your requirements at sales@promptcloud.com

×