The document discusses an agenda for a lecture on deriving knowledge from data at scale. The lecture will include a course project check-in, a thought exercise on data transformation, and a deeper dive into ensembling techniques. It also provides tips on gaining experience and intuition for data science, including becoming proficient in tools, deeply understanding algorithms, and focusing on specific data types through hands-on practice of experiments. Attribute selection techniques like filters, wrappers and embedded methods are also covered. Finally, the document discusses support vector machines and handling missing values in data.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
H2O World 2015 - Mark Landry
Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
If you are curious what is ML all about, this is a gentle introduction to Machine Learning and Deep Learning. This includes questions such as why ML/Data Analytics/Deep Learning ? Intuitive Understanding o how they work and some models in detail. At last I share some useful resources to get started.
Heuristic design of experiments w meta gradient searchGreg Makowski
Once you have started learning about predictive algorithms, and the basic knowledge discovery in databases process, what is the next level of detail to learn for a consulting project?
* Give examples of the many model training parameters
* Track results in a "model notebook"
* Use a model metric that combines both accuracy and generalization to rank models
* How to strategically search over the model training parameters - use a gradient descent approach
* One way to describe an arbitrarily complex predictive system is by using sensitivity analysis
Machine Learning and Real-World ApplicationsMachinePulse
This presentation was created by Ajay, Machine Learning Scientist at MachinePulse, to present at a Meetup on Jan. 30, 2015. These slides provide an overview of widely used machine learning algorithms. The slides conclude with examples of real world applications.
Ajay Ramaseshan, is a Machine Learning Scientist at MachinePulse. He holds a Bachelors degree in Computer Science from NITK, Suratkhal and a Master in Machine Learning and Data Mining from Aalto University School of Science, Finland. He has extensive experience in the machine learning domain and has dealt with various real world problems.
Fairly Measuring Fairness In Machine LearningHJ van Veen
We look at a case and two research papers on measuring discrimination in machine learning models for extending credit. Presentation given as part of the Sao Paulo Machine Learning Meetup, theme "Ethics in Data Science".
Module 1 introduction to machine learningSara Hooker
We believe in building technical capacity all over the world.
We are building and teaching an accessible introduction to machine learning for students passionate about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our work, visit www.deltanalytics.org
Moving Your Machine Learning Models to Production with TensorFlow ExtendedJonathan Mugan
ML is great fun, but now we want it to solve real problems. To do this, we need a way of keeping track of all of our data and models, and we need to know when our models fail and why. This talk will cover how to move ML to production with TensorFlow Extended (TFX). TFX is used by Google internally for machine-learning model development and deployment, and it has recently been made public. TFX consists of multiple pipeline elements and associated components, and this talk will cover them all, but three elements are particularly interesting: TensorFlow Data Validation, TensorFlow Model Analysis, and the What-If Tool.
The TensorFlow Data Validation library analyses incoming data and computes distributions over the feature values. This can show us which features many not be useful, maybe because they always have the same value, or which features may contain bugs. TensorFlow Model Analysis allows us to understand how well our data performs on different slices of the data. For example, we may find that our predictive models are more accurate for events that happen on Tuesdays, and such knowledge can be used to help us better understand our data and our business. The What-If Tool is as an interactive tool that allows you to change data and see what the model would say if a particular record had a particular feature value. It lets you probe your model, and it can automatically find the closest record with a different predicted label, which allows you to learn what the model is homing in on. Machine learning is growing up.
Provides a brief overview of what machine learning is, how it works (theory), how to prepare data for a machine learning problem, an example case study, and additional resources.
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelSri Ambati
H2O World 2015 - Arno Candel
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Edureka!
This Edureka Decision Tree tutorial will help you understand all the basics of Decision tree. This decision tree tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts, learn decision tree analysis along with examples.
Below are the topics covered in this tutorial:
1) Machine Learning Introduction
2) Classification
3) Types of classifiers
4) Decision tree
5) How does Decision tree work?
6) Demo in R
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
Two hour lecture I gave at the Jyväskylä Summer School. The purpose of the talk is to give a quick non-technical overview of concepts and methodologies in data science. Topics include a wide overview of both pattern mining and machine learning.
See also Part 2 of the lecture: Industrial Data Science. You can find it in my profile (click the face)
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
This Data Science presentation will help you understand what is Data Science, who is a Data Scientist, what does a Data Scientist do and also how Python is used for Data Science. Data science is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining. This Data Science tutorial will help you establish your skills at analytical techniques using Python. With this Data Science video, you’ll learn the essential concepts of Data Science with Python programming and also understand how data acquisition, data preparation, data mining, model building & testing, data visualization is done. This Data Science tutorial is ideal for beginners who aspire to become a Data Scientist.
This Data Science presentation will cover the following topics:
1. What is Data Science?
2. Who is a Data Scientist?
3. What does a Data Scientist do?
This Data Science with Python course will establish your mastery of data science and analytics techniques using Python. With this Python for Data Science Course, you’ll learn the essential concepts of Python programming and become an expert in data analytics, machine learning, data visualization, web scraping and natural language processing. Python is a required skill for many data science positions, so jumpstart your career with this interactive, hands-on course.
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. A data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn’s Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques. Those who complete the course will be able to:
1. Gain an in-depth understanding of data science processes, data wrangling, data exploration, data visualization, hypothesis building, and testing. You will also learn the basics of statistics.
Install the required Python environment and other auxiliary tools and libraries
2. Understand the essential concepts of Python programming such as data types, tuples, lists, dicts, basic operators and functions
3. Perform high-level mathematical computing using the NumPy package and its largelibrary of mathematical functions.
Learn more at: https://www.simplilearn.com
Data Science, Machine Learning and Neural NetworksBICA Labs
Lecture briefly overviewing state of the art of Data Science, Machine Learning and Neural Networks. Covers main Artificial Intelligence technologies, Data Science algorithms, Neural network architectures and cloud computing facilities enabling the whole stack.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
H2O World 2015 - Mark Landry
Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
If you are curious what is ML all about, this is a gentle introduction to Machine Learning and Deep Learning. This includes questions such as why ML/Data Analytics/Deep Learning ? Intuitive Understanding o how they work and some models in detail. At last I share some useful resources to get started.
Heuristic design of experiments w meta gradient searchGreg Makowski
Once you have started learning about predictive algorithms, and the basic knowledge discovery in databases process, what is the next level of detail to learn for a consulting project?
* Give examples of the many model training parameters
* Track results in a "model notebook"
* Use a model metric that combines both accuracy and generalization to rank models
* How to strategically search over the model training parameters - use a gradient descent approach
* One way to describe an arbitrarily complex predictive system is by using sensitivity analysis
Machine Learning and Real-World ApplicationsMachinePulse
This presentation was created by Ajay, Machine Learning Scientist at MachinePulse, to present at a Meetup on Jan. 30, 2015. These slides provide an overview of widely used machine learning algorithms. The slides conclude with examples of real world applications.
Ajay Ramaseshan, is a Machine Learning Scientist at MachinePulse. He holds a Bachelors degree in Computer Science from NITK, Suratkhal and a Master in Machine Learning and Data Mining from Aalto University School of Science, Finland. He has extensive experience in the machine learning domain and has dealt with various real world problems.
Fairly Measuring Fairness In Machine LearningHJ van Veen
We look at a case and two research papers on measuring discrimination in machine learning models for extending credit. Presentation given as part of the Sao Paulo Machine Learning Meetup, theme "Ethics in Data Science".
Module 1 introduction to machine learningSara Hooker
We believe in building technical capacity all over the world.
We are building and teaching an accessible introduction to machine learning for students passionate about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our work, visit www.deltanalytics.org
Moving Your Machine Learning Models to Production with TensorFlow ExtendedJonathan Mugan
ML is great fun, but now we want it to solve real problems. To do this, we need a way of keeping track of all of our data and models, and we need to know when our models fail and why. This talk will cover how to move ML to production with TensorFlow Extended (TFX). TFX is used by Google internally for machine-learning model development and deployment, and it has recently been made public. TFX consists of multiple pipeline elements and associated components, and this talk will cover them all, but three elements are particularly interesting: TensorFlow Data Validation, TensorFlow Model Analysis, and the What-If Tool.
The TensorFlow Data Validation library analyses incoming data and computes distributions over the feature values. This can show us which features many not be useful, maybe because they always have the same value, or which features may contain bugs. TensorFlow Model Analysis allows us to understand how well our data performs on different slices of the data. For example, we may find that our predictive models are more accurate for events that happen on Tuesdays, and such knowledge can be used to help us better understand our data and our business. The What-If Tool is as an interactive tool that allows you to change data and see what the model would say if a particular record had a particular feature value. It lets you probe your model, and it can automatically find the closest record with a different predicted label, which allows you to learn what the model is homing in on. Machine learning is growing up.
Provides a brief overview of what machine learning is, how it works (theory), how to prepare data for a machine learning problem, an example case study, and additional resources.
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelSri Ambati
H2O World 2015 - Arno Candel
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Edureka!
This Edureka Decision Tree tutorial will help you understand all the basics of Decision tree. This decision tree tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts, learn decision tree analysis along with examples.
Below are the topics covered in this tutorial:
1) Machine Learning Introduction
2) Classification
3) Types of classifiers
4) Decision tree
5) How does Decision tree work?
6) Demo in R
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
Two hour lecture I gave at the Jyväskylä Summer School. The purpose of the talk is to give a quick non-technical overview of concepts and methodologies in data science. Topics include a wide overview of both pattern mining and machine learning.
See also Part 2 of the lecture: Industrial Data Science. You can find it in my profile (click the face)
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
This Data Science presentation will help you understand what is Data Science, who is a Data Scientist, what does a Data Scientist do and also how Python is used for Data Science. Data science is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining. This Data Science tutorial will help you establish your skills at analytical techniques using Python. With this Data Science video, you’ll learn the essential concepts of Data Science with Python programming and also understand how data acquisition, data preparation, data mining, model building & testing, data visualization is done. This Data Science tutorial is ideal for beginners who aspire to become a Data Scientist.
This Data Science presentation will cover the following topics:
1. What is Data Science?
2. Who is a Data Scientist?
3. What does a Data Scientist do?
This Data Science with Python course will establish your mastery of data science and analytics techniques using Python. With this Python for Data Science Course, you’ll learn the essential concepts of Python programming and become an expert in data analytics, machine learning, data visualization, web scraping and natural language processing. Python is a required skill for many data science positions, so jumpstart your career with this interactive, hands-on course.
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. A data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn’s Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques. Those who complete the course will be able to:
1. Gain an in-depth understanding of data science processes, data wrangling, data exploration, data visualization, hypothesis building, and testing. You will also learn the basics of statistics.
Install the required Python environment and other auxiliary tools and libraries
2. Understand the essential concepts of Python programming such as data types, tuples, lists, dicts, basic operators and functions
3. Perform high-level mathematical computing using the NumPy package and its largelibrary of mathematical functions.
Learn more at: https://www.simplilearn.com
Data Science, Machine Learning and Neural NetworksBICA Labs
Lecture briefly overviewing state of the art of Data Science, Machine Learning and Neural Networks. Covers main Artificial Intelligence technologies, Data Science algorithms, Neural network architectures and cloud computing facilities enabling the whole stack.
Documentation management system & information databasesSyed Zaid Irshad
A document management system (DMS) is a computer system (or set of computer programs) used to track and store electronic documents and/or images of paper documents.
This presentation inludes step-by step tutorial by including the screen recordings to learn Rapid Miner.It also includes the step-step-step procedure to use the most interesting features -Turbo Prep and Auto Model.
An introduction to variable and feature selectionMarco Meoni
Presentation of a great paper from Isabelle Guyon (Clopinet) and André Elisseeff (Max Planck Institute) back in 2003, which outlines the main techniques for feature selection and model validation in machine learning systems
Sample Codes: https://github.com/davegautam/dotnetconfsamplecodes
Presentation on How you can get started with ML.NET. If you are existing .NET Stack Developer and Wanna use the same technology into Machine Learning, this slide focuses on how you can use ML.NET for Machine Learning.
It’s a data mining/machine learning tool developed by Department of
Computer Science, University of Waikato, New Zealand.
Weka is a collection of machine learning algorithms for data mining tasks.
Weka is open source software issued under the GNU General Public License
Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a Data Scientists toolkit for machine learning of incoming data sets.
The purpose of this one day course is to serve as an introduction to Machine Learning with Scikit-Learn. We will explore several clustering, classification, and regression algorithms for a variety of machine learning tasks and learn how to implement these tasks with our data using Scikit-Learn and Python. In particular, we will structure our machine learning models as though we were producing a data product, an actionable model that can be used in larger programs or algorithms; rather than as simply a research or investigation methodology.
Optimization Technique for Feature Selection and Classification Using Support...IJTET Journal
Abstract— Classification problems often have a large number of features in the data sets, but only some of them are useful for classification. Data Mining Performance gets reduced by Irrelevant and redundant features. Feature selection aims to choose a small number of relevant features to achieve similar or even better classification performance than using all features. It has two main objectives are maximizing the classification performance and minimizing the number of features. Moreover, the existing feature selection algorithms treat the task as a single objective problem. Selecting attribute is done by the combination of attribute evaluator and search method using WEKA Machine Learning Tool. We compare SVM classification algorithm to automatically classify the data using selected features with different standard dataset.
This is an introductory workshop for machine learning. Introduced machine learning tasks such as supervised learning, unsupervised learning and reinforcement learning.
In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search.
This talk presents a new open source Python library, Yellowbrick, which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Visualizers enable machine learning practitioners to visually interpret the model selection process, steer workflows toward more predictive models, and avoid common pitfalls and traps. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATADotNetCampus
Scopri come utilizzare Azure Machine Learning, un servizio cloud che consente alle aziende, università, centri di ricerca e sviluppatori di incorporare e sfrutturare nelle loro applicazioni funzionalità di apprendimento automatico e analisi predittiva su enormi set di dati. Tramite Azure ML Studio possiamo creare, testare, attuare e gestire soluzioni di analisi predittiva e apprendimento automatico nel cloud tramite un qualunque web browser. Durante la sessione si darà un saggio attraverso un esempio di analisi predittiva sul Flight Delay.
Talk presented at Strata'18 on unsupervised machine learning algorithms that operate on streams of data, continuously evolving as data streams through the system.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
7. Deriving Knowledge from Data at Scale
Three Steps (every 3 – 4 months)
1. Become proficient in using one tool;
2. Select one algorithm for deep dive;
3. Focus on one data type;
Hands on practice…
53. Deriving Knowledge from Data at Scale
What is Evaluated?
Attributes
Subsets of
Attributes
Evaluation
Method
Independent
Filters Filters
Learning
Algorithm Wrappers
54. Deriving Knowledge from Data at Scale
What is Evaluated?
Attributes
Subsets of
Attributes
Evaluation
Method
Independent
Filters Filters
Learning
Algorithm Wrappers
57. Deriving Knowledge from Data at Scale
Interface for classes that evaluate attributes…
Interface for ranking or searching for a subset of attributes…
58. Deriving Knowledge from Data at Scale
Select CorrelationAttributeEval for Pearson Correlation…
False, doesn’t return R score
True, returns R scores;
59. Deriving Knowledge from Data at Scale
Ranks attributes by their individual evaluations, used in
conjunction with GainRatio, Entropy, Pearson, etc…
Number of attributes to return,
-1 returns all ranked attributes;
Attributes to ignore (skip) in the
evaluation forma: [1, 3-5, 10];
Cutoff at which attributes can
be discarded, -1 no cutoff;
60. Deriving Knowledge from Data at Scale
What is Evaluated?
Attributes
Subsets of
Attributes
Evaluation
Method
Independent
Filters Filters
Learning
Algorithm Wrappers
65. Deriving Knowledge from Data at Scale
Interface for classes that evaluate attributes…
Interface for ranking or searching for a subset of attributes…
66. Deriving Knowledge from Data at Scale
True: Adds features that are correlated
with class and NOT intercorrelated with
other features already in selection.
False: Eliminates redundant features.
Precompute the correlation matrix in
advance, useful for fast backtracking, or
compute lazily. When given a large
number of attributes, compute lazily…
CfsSubsetEval
67. Deriving Knowledge from Data at Scale
What is Evaluated?
Attributes
Subsets of
Attributes
Evaluation
Method
Independent
Filters Filters
Learning
Algorithm Wrappers
68. Deriving Knowledge from Data at Scale
Select a subset of
attributes
Induce learning
algorithm on this subset
Evaluate the resulting
model (e.g., accuracy)
Stop? YesNo
71. Deriving Knowledge from Data at Scale
Interface for classes that evaluate attributes…
Interface for ranking or searching for a subset of attributes…
72. Deriving Knowledge from Data at Scale
Interface for classes that evaluate attributes…
Interface for ranking or searching for a subset of attributes…
WrapperSubsetEval
73. Deriving Knowledge from Data at Scale
Select and configure ML algorithm…
Accuracy (default discrete classes), RMSE (default
numeric), AUC, AUPRC, F-measure (discrete class)
Number of folds to use to estimate
subset accuracy
74. Deriving Knowledge from Data at Scale
BestFirst: Default search method, it searches
the space of descriptor subsets by greedy
hill-climbing augmented with a backtracking
facility. The BestFirst method may start with
the empty set of descriptors and searches
forward (default behavior), or starts with the
full set of attributes and searches backward,
or starts at any point and searches in both
directions (considering all single descriptor
additions and deletions at a given point).
Other options include:
• GreedyStepwise;
• EvolutionarySearch;
• ExhaustiveSearch;
• LinearForwardSearch;
• GeneticSearch (could take hours)
Search Method
94. Deriving Knowledge from Data at Scale
SMO and it's complexity parameter ("-C")
• load your dataset in the Explorer
• choose weka.classifiers.meta.CVParameterSelection as classifier
• select weka.classifiers.functions.SMO as base classifier within CVParameterSelection and modify its
setup if necessary, e.g., RBF kernel
• open the ArrayEditor for CVParameters and enter the following string (and click on Add):
C 2 8 4
This will test the complexity parameters 2, 4, 6 and 8 (= 4 steps)
• close dialogs and start the classifier
• you will get output similar to this one, with the best parameters found in bold:
95. Deriving Knowledge from Data at Scale
LibSVM
• load your dataset in the Explorer
• choose weka.classifiers.meta.CVParameterSelection as classifier
• select weka.classifiers.functions.LibSVM as base classifier within CVParameterSelection and modify
its setup if necessary, e.g., RBF kernel
• open the ArrayEditor for CVParameters and enter the following string (and click on Add):
G 0.01 0.1 10
96. Deriving Knowledge from Data at Scale
GridSearch
weka.classifiers.meta.GridSearch is a meta-classifier for exploring 2 parameters, hence the grid in the name.
Instead of just using a classifier, one can specify a base classifier and a filter, which both of them can be
optimized (one parameter each).
For each of the two axes, X and Y, one can specify the following parameters:
• min, the minimum value to start from.
• max, the maximum value.
• step, the step size used to get from min to max.
GridSearch can also optimized based on the following measures:
• Correlation coefficient (= CC)
• Root mean squared error (= RMSE)
• Root relative squared error (= RRSE)
• Mean absolute error (= MAE)
• Root absolute error (= RAE)
• Combined: (1-abs(CC)) + RRSE + RAE
• Accuracy (= ACC)
99. Deriving Knowledge from Data at Scale
Missing values – UCI machine learning repository, 31 of 68 data sets
reported to have missing values. “Missing” can mean many things…
MAR: "Missing at Random":
– usually best case
– usually not true
Non-randomly missing
Presumed normal, so not measured
Causally missing
– attribute value is missing because of other attribute values (or because of
the outcome value!)
110. Deriving Knowledge from Data at Scale
Original
Training data
....D1
D2 Dt-1 Dt
D
Step 1:
Create Multiple
Data Sets
C1 C2 Ct -1 Ct
Step 2:
Build Multiple
Classifiers
C*
Step 3:
Combine
Classifiers
111. Deriving Knowledge from Data at Scale
Why does it work?
25
13
25
06.0)1(
25
i
ii
i
112. Deriving Knowledge from Data at Scale
Ensemble vs. Base Classifier Error
As long as base classifier is better than random (error < 0.5),
ensemble will be superior to base classifier
113. Deriving Knowledge from Data at Scale
• Bagging
• Boosting
• DECORATE
meta-learners
base learner
114. Deriving Knowledge from Data at Scale
Training set
Matrix 1
Matrix 2
Matrix 3
Learning
algorithm
Model
M1
Learning
algorithm
Model
M2
Learning
algorithm
Model
Me
ENSEMBLE
Consensus
Model
Perturbed sets
C1
Cn
D1 Dm
Compounds/
Descriptor
Matrix
118. Deriving Knowledge from Data at Scale
Bagging
Leo Breiman
(1928-2005)
Leo Breiman (1996). Bagging predictors. Machine Learning. 24(2):123-140.
Bagging = Bootstrap Aggregation
119. Deriving Knowledge from Data at Scale
Training set S
.
.
.
C1
C2
C3
C4
Cn
Bootstrap
.
.
.
C3
C2
C2
C4
C4
Sample Si from training set S
• All compounds have the same probability to
be selected
• Each compound can be selected several
times or even not selected at all (i.e.
compounds are sampled randomly with
replacement)
Efron, B., & Tibshirani, R. J. (1993). "An introduction to the bootstrap". New York: Chapman & Hall
Si
D1 Dm D1 Dm
120. Deriving Knowledge from Data at Scale
Bagging
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
Training Data
Data ID
121. Deriving Knowledge from Data at Scale
The 0.632 bootstrap
This method is also called the 0.632 bootstrap
• A particular training instance has a probability of 1-1/n of not being
picked
• Thus its probability of ending up in the test data (not selected) is:
This means the training data will contain approximately 63.2% of the
instances
368.0
1
1 1
e
n
n
122. Deriving Knowledge from Data at Scale
Bagging
Training set
.
.
.
C1
C2
C3
C4
Cn
Learning
algorithm
Model
M1
Learning
algorithm
Model
M2
Learning
algorithm
Model
Me
ENSEMBLE
Consensus
Model
S1
S2
Se
C4
C2
C8
C2
C1
C9
C7
C2
C2
C1
C4
C3
C4
C8
Voting (classification)
Averaging (regression)
Data with
perturbed sets
of compounds
C1
123. Deriving Knowledge from Data at Scale
Classification - Files
train-ache.sdf/test-ache.sdf
train-ache-t3ABl2u3.arff/test-ache-t3ABl2u3.arff
ache-t3ABl2u3.hdr
124. Deriving Knowledge from Data at Scale
Exercise 1
Development of one individual rules-based model
(JRip method in WEKA)
128. Deriving Knowledge from Data at Scale
187. (C*C),(C*C*C),(C*C-C),(C*N),(C*N*C),(C-C),(C-C-C),xC*
81. (C-N),(C-N-C),(C-N-C),(C-N-C),xC
12. (C*C),(C*C),(C*C*C),(C*C*C),(C*C*N),xC
129. Deriving Knowledge from Data at Scale
What happens if we randomize the data
and rebuild a JRip model ?
134. Deriving Knowledge from Data at Scale
ROC AUC of the consensus
model as a function of the
number of bagging iterations
Classification
AChE
0.74
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0 2 4 6 8 10
Number of bagging iterations
ROC
AUC
135. Deriving Knowledge from Data at Scale
Boosting works by training a set of classifiers sequentially by combining them
for prediction, where each latter classifier focuses on the mistakes of the
earlier classifiers.
Yoav Freund Robert Shapire Jerome Friedman
Yoav Freund, Robert E. Schapire: Experiments with a new boosting algorithm. In: Thirteenth International Conference on
Machine Learning, San Francisco, 148-156, 1996.
J.H. Friedman (1999). Stochastic Gradient Boosting. Computational Statistics and Data Analysis. 38:367-378.
AdaBoost -
classification
Regression
boosting
136. Deriving Knowledge from Data at Scale
Training set
.
.
.
C1
C2
C3
C4
Cn
Learning
algorithm
Model
M1
Learning
algorithm
Model
M2
Learning
algorithm
Model
Mb
ENSEMBLE
Consensus
Model
S1
S2
Se
C1
C2
C3
C4
Cn
.
.
.
w
w
w
w
w
e
e
e
e
e
e
e
e
e
e
C1
C2
C3
C4
Cn
.
.
.
w
w
w
w
w
Weighted averaging
& thresholding
w
C4
Cn
.
.
.
w
w
w
w
C1
C2
C3
137. Deriving Knowledge from Data at Scale
Load train-ache-t3ABl2u3.arff
In classification tab, load test-ache-t3ABl2u3.arff
138. Deriving Knowledge from Data at Scale
In classifier tab, choose meta classifier AdaBoostM1
Setup an ensemble of one JRip model
140. Deriving Knowledge from Data at Scale
ROC AUC as a function of the
number of boosting iterations
Classification
AChE
Log(Number of boosting iterations)
ROCAUC
0.76
0.77
0.78
0.79
0.8
0.81
0.82
0.83
0 2 4 6 8 10
141. Deriving Knowledge from Data at Scale
0.7
0.75
0.8
0.85
0.9
0.95
1
1 10 100 1000
Bagging
Boosting
Base learner – DecisionStump
0.7
0.75
0.8
0.85
0.9
0.95
1
1 10 100
Base learner – JRip
142. Deriving Knowledge from Data at Scale
Conjecture: Bagging vs Boosting
Bagging leverages unstable base learners that
are weak because of overfitting (JRip, MLR)
Boosting leverages stable base learners that
are weak because of underfitting
(DecisionStump, SLR)
143. Deriving Knowledge from Data at Scale
Random Subspace Method
Tin Kam Ho
Tin Kam Ho (1998). The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on
Pattern Analysis and Machine Intelligence. 20(8):832-844.
144. Deriving Knowledge from Data at Scale
• All descriptors have the same probability to be
selected
• Each descriptor can be selected only once
• Only a certain part of descriptors are selected
in each run
.
.
.
D1 D2 D3 D4 Dm
D3 D2 Dm D4
C1
Cn
C1
Cn
Training set with initial pool of descriptors
Training set with randomly selected descriptors
145. Deriving Knowledge from Data at Scale
Random Subspace Method
145
Training set
Learning
algorithm
Model
M1
Learning
algorithm
Model
M2
Learning
algorithm
Model
Me
ENSEMBLE
Consensus
Model
S1
S2
Se
Voting (classification)
Averaging (regression)
Data sets with
randomly selected
descriptors
D1 D2 D3 D4 Dm
D4 D2 D3
D1 D2 D3
D4 D2 D1
146. Deriving Knowledge from Data at Scale
Load train-logs-t1ABl2u4.arff
In classification tab, load test-logs-t1ABl2u4.arff
148. Deriving Knowledge from Data at Scale
Base classifier: Multi-Linear Regression
without descriptor selection
Build an ensemble of 1 model
… then build an ensemble of 10 models.
151. Deriving Knowledge from Data at Scale
Random Forest
random tree
Leo Breiman
(1928-2005)
Leo Breiman (2001). Random Forests. Machine Learning. 45(1):5-32.
Random Forest = Bagging + Random Subspace
152. Deriving Knowledge from Data at Scale
David H. Wolpert
Wolpert, D., Stacked Generalization., Neural Networks, 5(2), pp. 241-259., 1992
Breiman, L., Stacked Regression, Machine Learning, 24, 1996
153. Deriving Knowledge from Data at Scale
Training set
Data
set
S
Data
set
S
Data
set
S
Learning
algorithm
L1
Model
M1
Model
M2
Model
Me
ENSEMBLE
Consensus
Model
The same data set
Data
set
S
C1
Cn
D1 Dm
Learning
algorithm
L2
Learning
algorithm
Le
Machine Learning
Meta-Method
(e.g. MLR)
Different algorithms
155. Deriving Knowledge from Data at Scale
•Delete the classifier ZeroR
•Add PLS classifier (default
parameters)
•Add Regression Tree M5P (default
parameters)
•Add Multi-Linear Regression without
descriptor selection
156. Deriving Knowledge from Data at Scale
Click here
Select Multi-Linear
Regression as meta-
method
158. Deriving Knowledge from Data at Scale
Exercise 5
Rebuild the stacked model using:
• kNN (default parameters)
• Multi-Linear Regression without descriptor selection
• PLS classifier (default parameters)
• Regression Tree M5P