The document discusses different techniques for cross-validation in machine learning. It defines cross-validation as a technique for validating model efficiency by training on a subset of data and testing on an unseen subset. It then describes various cross-validation methods like hold out validation, k-fold cross-validation, leave one out cross-validation, and their implementation in scikit-learn.
K-Folds cross-validation is one method that attempts to maximize the use of the available data for training and then testing a model. It is particularly useful for assessing model performance, as it provides a range of accuracy scores across (somewhat) different data sets.
K-Folds cross-validation is one method that attempts to maximize the use of the available data for training and then testing a model. It is particularly useful for assessing model performance, as it provides a range of accuracy scores across (somewhat) different data sets.
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Maninda Edirisooriya
Model Testing and Evaluation is a lesson where you learn how to train different ML models with changes and evaluating them to select the best model out of them. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
Tutorial @Ubicomp 2015: Bridging the Gap -- Machine Learning for Ubiquitous Computing (evaluation session).
A tutorial on promises and pitfalls of Machine Learning for Ubicomp (and Human Computer Interaction). From Practitioners for Practitioners.
Presenter: Nils Hammerla <n.hammerla@gmail.com>
video recording of talks as they wer held at Ubicomp:
https://youtu.be/LgnnlqOIXJc?list=PLh96aGaacSgXw0MyktFqmgijLHN-aQvdq
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Machine Learning Project - 1994 U.S. CensusTim Enalls
The PowerPoint contains a demo for communicating machine learning findings using 1994 U.S. Census data.
For more content from me, visit the following URLs:
https://analyticsexplained.com
https://www.youtube.com/analyticsexplained
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Maninda Edirisooriya
Model Testing and Evaluation is a lesson where you learn how to train different ML models with changes and evaluating them to select the best model out of them. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
Tutorial @Ubicomp 2015: Bridging the Gap -- Machine Learning for Ubiquitous Computing (evaluation session).
A tutorial on promises and pitfalls of Machine Learning for Ubicomp (and Human Computer Interaction). From Practitioners for Practitioners.
Presenter: Nils Hammerla <n.hammerla@gmail.com>
video recording of talks as they wer held at Ubicomp:
https://youtu.be/LgnnlqOIXJc?list=PLh96aGaacSgXw0MyktFqmgijLHN-aQvdq
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Machine Learning Project - 1994 U.S. CensusTim Enalls
The PowerPoint contains a demo for communicating machine learning findings using 1994 U.S. Census data.
For more content from me, visit the following URLs:
https://analyticsexplained.com
https://www.youtube.com/analyticsexplained
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
2. • Cross-validation is a technique for validating the model efficiency by
training it on the subset of input data and testing on previously unseen
subset of the input data. We can also say that it is a technique to check
how a statistical model generalizes to an independent dataset.
• In machine learning there is always the need to test the stability of the
model. It means based only on the training dataset; we can't fit our
model on the training dataset. For this purpose, we reserve a particular
sample of the dataset, which was not part of the training dataset. After
that, we test our model on that sample before deployment, and this
complete process comes under cross-validation. This is something
different from the general train-test split.
3. • Hence the basic steps of cross-validations are:
• Reserve a subset of the dataset as a validation set.
• Provide the training to the model using the training dataset.
• Now, evaluate model performance using the validation set. If the model
performs well with the validation set, perform the further step, else
check for the issues.
4. Key aspects of evaluating the quality of the model
are –
• How accurate the model is
• How generalized the model is
• When we start building a model and train it with the ‘entire’ dataset, we can very well calculate its accuracy on this training data set. But we
cannot test how this model will behave with new data which is not present in the training set, hence its generalization cannot be determined.
• Hence we need techniques to make use of the same data set for both training and testing of the models.
• In machine learning, Cross-Validation is the technique to evaluate how well the model has generalized and its overall accuracy. For this purpose,
it randomly samples data from the dataset to create training and testing sets. There are multiple cross-validation approaches as follows –
• 1.Hold Out Approach
• 2.Leave One Out Cross-Validation
• 3.K-Fold Cross-Validation
• 4.Stratified K-Fold Cross-Validation
• 5.Repeated Random Train Test Split
5. • 1. Hold Out Approach
• In the hold-out approach, the data set is split into the train and test set with random sampling. The
train set is used for training the model and the test set is used to test its accuracy with unseen data. If
the training and accuracy are almost the same then the model is said to have generalized well. It is
common to use 80% of data for training and the remaining 20% for testing.
• Advantages
• It is simple and easy to implement
• The execution time is less.
• Disadvantages
• If the dataset itself is small, setting aside portions for testing would reduce the robustness of the
model. This is because the training sample may not be representative of the entire dataset.
• The evaluation metrics may vary due to the randomness of the split between the train and test set.
• Although 80-20 split for train test is widely followed, there is no thumb rule for the split and hence
the results can vary based on how the train test split is done.
6. • 2. Leave One Out Cross Validation (LOOCV)
• In this technique, if there are n observations in the dataset, only one observation is
reserved for testing, and the remaining data points are used for training. This is repeated n
times till all data points have been used for testing purposes in each iteration. Finally, the
average accuracy is calculated by combining the accuracies of each iteration.
• Advantage
• Since every data participates both for training and testing, the overall accuracy is more
reliable.
• It is very useful when the dataset is small.
• Disadvantage
• LOOCV is not practical to use when the number of data observations n is huge. E.g.
imagine a dataset with 500,000 records, then 500,000 model needs to be created which is
not really feasible.
• There is a huge computational and time cost associated with the LOOCV approach.
7. • 3. K-Fold Cross-Validation
• In the K-Fold Cross-Validation approach, the dataset is split into K folds. Now in 1st iteration, the first
fold is reserved for testing and the model is trained on the data of the remaining k-1 folds.
• In the next iteration, the second fold is reserved for testing and the remaining folds are used for
training. This is continued till the K-th iteration. The accuracy obtained in each iteration is used to
derive the overall average accuracy for the model.
• Advantages
• K-Fold cross-validation is useful when the dataset is small and splitting it is not possible to split it in
train-test set (hold out approach) without losing useful data for training.
• It helps to create a robust model with low variance and low bias as it is trained on all data
• Disadvantages
• The major disadvantage of K-Fold Cross Validation is that the training needs to be done K times and
hence it consumes more time and resources,
• Not recommended to be used with sequential time series data.
• When the dataset is imbalanced, K-fold cross-validation may not give good results. This is because
some folds may have just a few or no records for the minority class.
8. • 4. Stratified K-Fold Cross-Validation
• Stratified K-fold cross-validation is useful when the data is imbalanced.
While sampling data into K-folds it makes sure that the distribution of
all classes in each fold is maintained. For example, if in the dataset 98%
of data belongs to class B and 2% to class A, the stratified sampling will
make sure each fold contains the two classes in the same ratio of 98%
to 2%.
• Advantage
• Stratified K-fold cross-validation is recommended when the dataset is
imbalanced.
9. • 5. Repeated Random Test-Train Split
• Repeated random test-train split is a hybrid of traditional train-test
splitting and the k-fold cross-validation method. In this technique, we
create random splits of the data into the training-test set and then
repeat this process multiple times, just like the cross-validation
method.
10. Examples of Cross-Validation in Sklearn Library
• About Dataset
• We will be using Parkinson’s disease dataset for all examples of cross-validation in the Sklearn library. The goal is to predict whether or not a
particular patient has Parkinson’s disease. We will be using the decision tree algorithm in all the examples.
• The dataset has 21 attributes and 195 rows. The various fields of the Parkinson’s Disease dataset are as follows –
• MDVP:Fo(Hz) – Average vocal fundamental frequency
• MDVP:Fhi(Hz) – Maximum vocal fundamental frequency
• MDVP:Flo(Hz) – Minimum vocal fundamental frequency
• MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP – Several
• measures of variation in fundamental frequency
• MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA– Several measures of variation in amplitude
• NHR,HNR – Two measures of ratio of noise to tonal components in the voice
• status – Health status of the subject (one) – Parkinson’s, (zero) – healthy
• RPDE,D2 – Two nonlinear dynamical complexity measures
• DFA – Signal fractal scaling exponent
• spread1,spread2PPE – Three nonlinear measures of fundamental frequency variation
11. • Importing Necessary Libraries
• We first load the libraries required to build our model.
• import pandas as pd
• import numpy as np
• from sklearn.tree import DecisionTreeClassifier
• from sklearn.model_selection import train_test_split
• from sklearn.model_selection import KFold
• from sklearn.model_selection import StratifiedKFold
12. • Reading CSV Data into Pandas
• Next, we load the dataset in the CSV file into the pandas dataframes
and check the top 5 rows.
• df=pd.read_csv(“Parkinsson disease.csv")
• df.head()
13. • Data Preprocessing
• The “name” column is not going to add any value in training the model
and can be discarded, so we are dropping it below.
• df.drop(df.columns[0], axis = 1, inplace = True)
• Next, we will separate the feature and target matrix as shown below.
• #Independent And dependent features
• X=df.drop('status', axis=1)
• y=df['status']
14. Hold out Approach in Sklearn
• The hold-out approach can be applied by using train_test_split module of
sklearn.model_selection
• In the below example we have split the dataset to create the test data with a size of 30%
and train data with a size of 70%. The random_state number ensures the split is
deterministic in every run.
• from sklearn.model_selection import train_test_split
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=4)
• model = DecisionTreeClassifier()
• model.fit(X_train, y_train)
• result = model.score(X_test, y_test)print(result)
• Out[38]:
• 0.7796610169491526
15. K-Fold Cross-Validation
• K-Fold Cross-Validation in Sklearn can be applied by using cross_val_score module of sklearn.model_selection.
• In the below example, 10 folds are used that produced 10 accuracy scores using which we calculated the mean
score.
• In [40]:
• from sklearn.model_selection import cross_val_score
• model=DecisionTreeClassifier()
• kfold_validation=KFold(10)
• results=cross_val_score(model,X,y,cv=kfold_validation)
• print(results)print(np.mean(results))
• Out[40]:
• [0.7 0.8 0.8 0.8 0.8 0.78947368
• 0.84210526 1. 0.68421053 0.36842105]
• 0.758421052631579
16. • Stratified K-fold Cross-Validation
• In Sklearn stratified K-fold cross-validation can be applied by
using StratifiedKFold module of sklearn.model_selection
• In the below example, the dataset is divided into 5 splits or folds. It returns 5
accuracy scores using which we calculate the final mean score.
• from sklearn.model_selection import StratifiedKFold
• skfold=StratifiedKFold(n_splits=5)
• model=DecisionTreeClassifier()scores=cross_val_score(model,X,y,cv=skfold)
• print(scores)print(np.mean(scores))
• Out[41]:
• array([0.61538462, 0.79487179, 0.71794872, 0.74358974, 0.71794872])
• 0.717948717948718