This document provides an overview of machine learning, including definitions, types, popular use cases, algorithms, and programming languages/libraries. It defines machine learning as a field that gives computers the ability to learn without being explicitly programmed. The document outlines supervised and unsupervised learning, describes regression problems and solutions using gradient descent. It also discusses programming languages like R and Python and libraries such as WEKA, Mahout, and Google Prediction API.
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
This talk gives an introduction to tree-based methods, both from a theoretical and practical point of view. It covers decision trees, random forests and boosting estimators, along with concrete examples based on Scikit-Learn about how they work, when they work and why they work.
KDD Cup 2021で開催された時系列異常検知コンペ
Multi-dataset Time Series Anomaly Detection (https://compete.hexagon-ml.com/practice/competition/39/) に参加して
5位入賞した解法の紹介と上位解法の整理のための資料です.
9/24のKDD2021参加報告&論文読み会 (https://connpass.com/event/223966/) の発表資料です.
Using observability, logs, metrics and traces as a data source for supervised and reinforcement machine learning techniques with a goal to optimize large scale systems.
발표자: 이활석(NAVER)
발표일: 2017.11.
최근 딥러닝 연구는 지도학습에서 비지도학습으로 급격히 무게 중심이 옮겨 지고 있습니다. 본 과정에서는 비지도학습의 가장 대표적인 방법인 오토인코더의 모든 것에 대해서 살펴보고자 합니다. 차원 축소관점에서 가장 많이 사용되는Autoencoder와 (AE) 그 변형 들인 Denoising AE, Contractive AE에 대해서 공부할 것이며, 데이터 생성 관점에서 최근 각광 받는 Variational AE와 (VAE) 그 변형 들인 Conditional VAE, Adversarial AE에 대해서 공부할 것입니다. 또한, 오토인코더의 다양한 활용 예시를 살펴봄으로써 현업과의 접점을 찾아보도록 노력할 것입니다.
1. Revisit Deep Neural Networks
2. Manifold Learning
3. Autoencoders
4. Variational Autoencoders
5. Applications
Machine Learning : why we should know and how it worksKevin Lee
The most popular buzz word nowadays in the technology world is “Machine Learning (ML).” Most economists and business experts foresee Machine Learning changing every aspect of our lives in the next 10 years through automating and optimizing processes such as: self-driving vehicles; online recommendation on Netflix and Amazon; fraud detection in banks; image and video recognition; natural language processing; question answering machines (e.g., IBM Watson); and many more. This is leading many organizations to seek experts who can implement Machine Learning into their businesses.
Statistical programmers and statisticians in the pharmaceutical industry are in very interesting positions. We have very similar backgrounds as Machine Learning experts, such as programming, statistics, and data expertise, thus embodying the essential technical skill sets needed. This similarity leads many individuals to ask us about Machine Learning. If you are the leaders of biometric groups, you get asked more often.
The paper is intended for statistical programmers and statisticians who are interested in learning and applying Machine Learning to lead innovation in the pharmaceutical industry. The paper will start with the introduction of basic concepts of Machine Learning - hypothesis and cost function and gradient descent. Then, paper will introduce Supervised ML (e.g., Support Vector Machine, Decision Trees, Logistic Regression), Unsupervised ML (e.g., clustering) and the most powerful ML algorithm, Artificial Neural Network (ANN). The paper will also introduce some of popular SAS ® ML procedures and SAS Visual Data Mining and Machine Learning. Finally, the paper will discuss the current ML implementation, its future implementation and how programmers and statisticians could lead this exciting and disruptive technology in pharmaceutical industry.
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
This talk gives an introduction to tree-based methods, both from a theoretical and practical point of view. It covers decision trees, random forests and boosting estimators, along with concrete examples based on Scikit-Learn about how they work, when they work and why they work.
KDD Cup 2021で開催された時系列異常検知コンペ
Multi-dataset Time Series Anomaly Detection (https://compete.hexagon-ml.com/practice/competition/39/) に参加して
5位入賞した解法の紹介と上位解法の整理のための資料です.
9/24のKDD2021参加報告&論文読み会 (https://connpass.com/event/223966/) の発表資料です.
Using observability, logs, metrics and traces as a data source for supervised and reinforcement machine learning techniques with a goal to optimize large scale systems.
발표자: 이활석(NAVER)
발표일: 2017.11.
최근 딥러닝 연구는 지도학습에서 비지도학습으로 급격히 무게 중심이 옮겨 지고 있습니다. 본 과정에서는 비지도학습의 가장 대표적인 방법인 오토인코더의 모든 것에 대해서 살펴보고자 합니다. 차원 축소관점에서 가장 많이 사용되는Autoencoder와 (AE) 그 변형 들인 Denoising AE, Contractive AE에 대해서 공부할 것이며, 데이터 생성 관점에서 최근 각광 받는 Variational AE와 (VAE) 그 변형 들인 Conditional VAE, Adversarial AE에 대해서 공부할 것입니다. 또한, 오토인코더의 다양한 활용 예시를 살펴봄으로써 현업과의 접점을 찾아보도록 노력할 것입니다.
1. Revisit Deep Neural Networks
2. Manifold Learning
3. Autoencoders
4. Variational Autoencoders
5. Applications
Machine Learning : why we should know and how it worksKevin Lee
The most popular buzz word nowadays in the technology world is “Machine Learning (ML).” Most economists and business experts foresee Machine Learning changing every aspect of our lives in the next 10 years through automating and optimizing processes such as: self-driving vehicles; online recommendation on Netflix and Amazon; fraud detection in banks; image and video recognition; natural language processing; question answering machines (e.g., IBM Watson); and many more. This is leading many organizations to seek experts who can implement Machine Learning into their businesses.
Statistical programmers and statisticians in the pharmaceutical industry are in very interesting positions. We have very similar backgrounds as Machine Learning experts, such as programming, statistics, and data expertise, thus embodying the essential technical skill sets needed. This similarity leads many individuals to ask us about Machine Learning. If you are the leaders of biometric groups, you get asked more often.
The paper is intended for statistical programmers and statisticians who are interested in learning and applying Machine Learning to lead innovation in the pharmaceutical industry. The paper will start with the introduction of basic concepts of Machine Learning - hypothesis and cost function and gradient descent. Then, paper will introduce Supervised ML (e.g., Support Vector Machine, Decision Trees, Logistic Regression), Unsupervised ML (e.g., clustering) and the most powerful ML algorithm, Artificial Neural Network (ANN). The paper will also introduce some of popular SAS ® ML procedures and SAS Visual Data Mining and Machine Learning. Finally, the paper will discuss the current ML implementation, its future implementation and how programmers and statisticians could lead this exciting and disruptive technology in pharmaceutical industry.
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
Lesson 16: Derivatives of Logarithmic and Exponential FunctionsMatthew Leingang
We show the the derivative of the exponential function is itself! And the derivative of the natural logarithm function is the reciprocal function. We also show how logarithms can make complicated differentiation problems easier.
Variational Autoencoders For Image GenerationJason Anderson
Meetup: https://www.meetup.com/Cognitive-Computing-Enthusiasts/events/260580395/
Video: https://www.youtube.com/watch?v=fnULFOyNZn8
Blog: http://www.compthree.com/blog/autoencoder/
Code: https://github.com/compthree/variational-autoencoder
An autoencoder is a machine learning algorithm that represents unlabeled high-dimensional data as points in a low-dimensional space. A variational autoencoder (VAE) is an autoencoder that represents unlabeled high-dimensional data as low-dimensional probability distributions. In addition to data compression, the randomness of the VAE algorithm gives it a second powerful feature: the ability to generate new data similar to its training data. For example, a VAE trained on images of faces can generate a compelling image of a new "fake" face. It can also map new features onto input data, such as glasses or a mustache onto the image of a face that initially lacks these features. In this talk, we will survey VAE model designs that use deep learning, and we will implement a basic VAE in TensorFlow. We will also demonstrate the encoding and generative capabilities of VAEs and discuss their industry applications.
Comparison Study of Decision Tree Ensembles for RegressionSeonho Park
Nowadays, decision tree ensemble methods are widely used for solving classification and regression problem due to their rigorousness and robustness. To compare with classification, the performance in regression problem so far has not been yet addressed in detail. In this presentation, we review the state-of-art decision tree ensemble methodology in scikit-learn and xgboost for regression. Also, empirical study results are illustrated to compare their performance and computational efficiency.
Risk-Aversion, Risk-Premium and Utility TheoryAshwin Rao
This lecture helps understand the concepts of Risk-Aversion and Risk-Premium viewed from the lens of Utility Theory. These are foundational economic concepts used widely in Financial applications - Portfolio problems and Pricing problems, to name a couple.
This slide is my presentation for a reading circle "Machine Learning Professional Series".
Japanese version is here.
http://www.slideshare.net/matsukenbook/ss-50545587
The closed interval method tells us how to find the extreme values of a continuous function defined on a closed, bounded interval: we check the end points and the critical points.
Lesson 16: Derivatives of Logarithmic and Exponential FunctionsMatthew Leingang
We show the the derivative of the exponential function is itself! And the derivative of the natural logarithm function is the reciprocal function. We also show how logarithms can make complicated differentiation problems easier.
Variational Autoencoders For Image GenerationJason Anderson
Meetup: https://www.meetup.com/Cognitive-Computing-Enthusiasts/events/260580395/
Video: https://www.youtube.com/watch?v=fnULFOyNZn8
Blog: http://www.compthree.com/blog/autoencoder/
Code: https://github.com/compthree/variational-autoencoder
An autoencoder is a machine learning algorithm that represents unlabeled high-dimensional data as points in a low-dimensional space. A variational autoencoder (VAE) is an autoencoder that represents unlabeled high-dimensional data as low-dimensional probability distributions. In addition to data compression, the randomness of the VAE algorithm gives it a second powerful feature: the ability to generate new data similar to its training data. For example, a VAE trained on images of faces can generate a compelling image of a new "fake" face. It can also map new features onto input data, such as glasses or a mustache onto the image of a face that initially lacks these features. In this talk, we will survey VAE model designs that use deep learning, and we will implement a basic VAE in TensorFlow. We will also demonstrate the encoding and generative capabilities of VAEs and discuss their industry applications.
Comparison Study of Decision Tree Ensembles for RegressionSeonho Park
Nowadays, decision tree ensemble methods are widely used for solving classification and regression problem due to their rigorousness and robustness. To compare with classification, the performance in regression problem so far has not been yet addressed in detail. In this presentation, we review the state-of-art decision tree ensemble methodology in scikit-learn and xgboost for regression. Also, empirical study results are illustrated to compare their performance and computational efficiency.
Risk-Aversion, Risk-Premium and Utility TheoryAshwin Rao
This lecture helps understand the concepts of Risk-Aversion and Risk-Premium viewed from the lens of Utility Theory. These are foundational economic concepts used widely in Financial applications - Portfolio problems and Pricing problems, to name a couple.
This slide is my presentation for a reading circle "Machine Learning Professional Series".
Japanese version is here.
http://www.slideshare.net/matsukenbook/ss-50545587
The closed interval method tells us how to find the extreme values of a continuous function defined on a closed, bounded interval: we check the end points and the critical points.
Nervous system forms an interconnecting fibers of communication network.
In the ‘hard-wiring’ of the nerves, the signals travel in the form of a flow of electrical current called nerve impulses.
The stimulus-response reactions afford internal constancy in the face of environmental changes.
This slides gives a brief description about the structure of neuron, information flow in neurons and also how transcription takesplace and protein synthesis in the cell. This presentation also explain the types of glia and non glial cells.
Introduction to Cognitive Neuroscience first lecture for the Autumn Semester for Cognitive Neuroscience at International Institute of Information Technology Hyderabad India
The human brain is one of the most complicated objects in the universe. Although it weighs less than 3 pounds, it manages everything from our heart rates to our thoughts and feelings. The functions of the brain are varied, and include: thinking, perception (sensing), emotion, signaling, and many of our physical functions. Our cognition, feelings and behavior are all the result of our brains.
Deep learning continues to push the state of the art in domains such as computer vision, natural language understanding and recommendation engines. One of the key reasons for this progress is the availability of highly flexible and developer friendly deep learning frameworks. During this workshop, we will provide a short background on Deep Learning focusing on relevant application domains and an introduction to the powerful and scalable Deep Learning framework, Apache MXNet. At the end of this tutorial you’ll be able to train your own deep neural network, fine tune existing state of the art models for image and object recognition. We’ll also deep dive on setting up your deep learning infrastructure on AWS and model deployment on AWS Lambda.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2021/02/introducing-machine-learning-and-how-to-teach-machines-to-see-a-presentation-from-tryolabs/
Facundo Parodi, Research and Machine Learning Engineer at Tryolabs, presents the “Introduction to Machine Learning and How to Teach Machines to See” tutorial at the September 2020 Embedded Vision Summit.
What is machine learning? How can machines distinguish a cat from a dog in an image? What’s the magic behind convolutional neural networks? These are some of the questions Parodi answers in this introductory talk on machine learning in computer vision.
Parodi introduces machine learning and explores the different types of problems it can solve. He explains the main components of practical machine learning, from data gathering and training to deployment. He then focuses on deep learning as an important machine learning technique and provides an introduction to convolutional neural networks and how they can be used to solve image classification problems. Parodi will also touches on recent advancements in deep learning and how they have revolutionized the entire field of computer vision.
In this presentation, Gil Tene (CTO, Azul Systems, and a JVM mechanic) discusses examples of how the freedom this machine has in re-interpreting the meaning of code can have dramatic implications on performance and other code behavior.
Jose Leiva, data scientist at Ets Asset Management Factory, gives an accurate and simple introduction to Machine Learning. He explains some of the problems that quantitative managers have to get alpha in the markets, and how to face them using Deep Learning.
I gave this talk at Buzzwords just now to fill in for an ill speaker.
The topics include things that are being added to or taken out of Mahout. These include cruft (out), fast clustering (in), nearest neighbor search (in), Pig bindings for Mahout (who knows).
[REPEAT] Deep Learning for Developers: An Introduction, Featuring Samsung SDS...Amazon Web Services
Artificial intelligence (AI) is rapidly evolving, and much of the advancement is driven by deep learning, a machine learning technique inspired by the inner workings of the human brain. In this session, learn what deep learning is and how you can use it in your applications to unlock new and exciting capabilities for your customers and business. Also hear from Samsung SDS about how it developed a deep-learning model for cardiac arrhythmia detection using Apache MXNet, an open-source deep-learning framework. By the end of the session, you will understand how to leverage deep learning in your applications and get started with it. Please join us for a speaker meet-and-greet following this session at the Speaker Lounge (ARIA East, Level 1, Willow Lounge). The meet-and-greet starts 15 minutes after the session and runs for half an hour.
Machine Learning and computing power have made huge improvements in the last decade. It’s now possible to unlock complex problems in multidimensional space with ensemble, brute force algorithms or deep neural networks, with performances that were unthinkable a few years ago. However the use of black box models is still frown upon in a business setting. In fact the decision functions of those models are often impossible to interpret for humans, can be biased or just based on absurd assumption. What if your risk model denies loans to people on ethnic ground? SHAP comes as an innovative framework to obtain local explanations for the output of a model, making the black box much more transparent.
Similar to Machine Learning - What, Where and How (20)
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
4. Definition
Field of Study that gives Computers the ability
to learn without being explicitly programmed
--Arthur Samuel
A more Mathematical one
A Computer program is said to learn from
Experience E with respect to some Task T and
Performance measure P, if it's Performance at
Task in T, as measured by P, improves with
Experience E –Tom M. Mitchell 4
5. Related Disciplines
Sub-Field of Artificial Intelligence
Deals with Design and Development of Algorithms
Closely related to Data Mining
Uses techniques from Statistics, Probability Theory
and Pattern Recognition
Not new but growing fast because of Big Data
5
6. Types of Machine Learning
Supervised Machine Learning
Provide right set of answers for different set of
questions
Underlying algorithm learns/infers over a period
of time
Tries to return correct answers for similar
questions
Unsupervised Machine Learning
Provide data &
Let underlying algorithm find some structure 6
7. Popular Use Cases
Recommendation Systems
Amazon, Netflix, iTunes Genius, IMDb...
Up-Selling & Churn Analysis
Customer Sentiment Analysis
Market Segmentation
...
7
10. Typical Machine Learning Algorithm
Training Set
Learning
Algorithm
Input Expected
Hypothesis Output
Features
10
11. Let's Simplify a bit
➢
Goal is to draw a
4000
House Sizes vs Prices Straight line which
3500
covers our Data-Set
3000 reasonably
2500 ➢
Our Hypothesis can be
Prices (1000 USD)
2000
1500
hθ ( x)=θ0+θ1 x
hΘthat 0+Θ1(xx)≃ y
x=Θ h
1000
Such
500
➢
0
θ
50 100 150 200 250 300 350 400
House Sizes (Sq Yards)
11
12. In Mathematical Terms
➢
Hypothesis hθ ( x)=θ0+θ1 x
➢
Parameters θ0 ,θ1
➢
Cost Function
➢
We would like to minimize J (θ0 ,θ1 )
12
13. Solution : Gradient Descent
➢
Start with an initial
values of θ0 , θ1
➢
Keep Changing θ0 , θ1
until we end up at
minimum
13
22. What is WEKA ?
Developed by Machine Learning Group,
University of Waikato, New Zealand
Collection of Machine Learning Algorithms
Contains tools for
Data Pre-Processing
Classification & Regression
Clustering
Visualization
Can be embedded inside your application
Implemented in Java
22
24. Terminology
Training DataSet == Instances
Each Row in DataSet == Instance
Instance is Collection of Attributes (Features)
Types of Attributes
Nominal (True, False, Malignant, Benign,
Cloudy...)
Real values (6, 2.34, 0...)
String (“Interesting”, “Really like it”, “Hate
It” ...)
...
24
25. Sample DataSets
@RELATION house @RELATION CPU
@ATTRIBUTE houseSize real @attribute outlook {sunny, overcast,
@ATTRIBUTE lotSize real rainy}
@ATTRIBUTE bedrooms real @attribute temperature real
@ATTRIBUTE granite real @attribute humidity real
@ATTRIBUTE bathroom real @attribute windy {TRUE, FALSE}
@ATTRIBUTE sellingPrice real @attribute play {yes, no}
@DATA @data
3529,9191,6,0,0,205000 sunny,85,85,FALSE,no
3247,10061,5,1,1,224900 sunny,80,90,TRUE,no
4032,10150,5,0,1,197900 overcast,83,86,FALSE,yes
2397,14156,4,1,0,189900 rainy,70,96,FALSE,yes
2200,9600,4,0,1,195000 rainy,68,80,FALSE,yes
3536,19994,6,1,1,325000 rainy,65,70,TRUE,no
2983,9365,5,0,1,230000 overcast,64,65,TRUE,yes
25
28. Apache Mahout
➢
Collection of Machine Learning Algorithms
➢
Map-Reduce Enabled (most cases)
➢
DataSources
➢
Database
➢
File-System
➢
Lucene Integration
➢
Very Active Community
➢
Apache License
28
29. WEKA vs Apache Mahout
WEKA Apache-Mahout
➢
Lot of Algorithms ➢
Lesser number of
➢
Tools for Algorithms but
➢
Modeling growing
➢
Comparison ➢
Lack of tools for
➢
Data-Flow Modeling
➢
May need work for ➢
Ready by Design for
running on large data- Large Scale
sets ➢
Vibrant Community
➢
License Issues ➢
Apache License
29
31. Google Prediction API 101
➢
Cloud Based Web Service for Machine Learning
➢
Exposed as REST API
➢
Does not require any Machine Learning
knowledge
➢
Capabilities
➢
Categorical &
➢
Regression
31
37. Resources
➢
Online Machine Learning Course - Prof. Andrew
Ng, Stanford University
➢
WEKA Wiki and API docs
➢
Apache Mahout Wiki
➢
IBM Developer Works Articles
➢
Google Prediction API Web Site
➢
Data Mining : Practical Machine Learning Tools &
Techniques – Ian H. Witten, Eibe Frank, Mark Hall
➢
Machine Learning Forums
37