This project aims to determine the housing prices of California properties for new sellers and also for buyers to estimate the profitability of the deal using various regression models.
Below are the details of the models implemented and their performance score:
Linear Regression: RMSE- 68321.7051304
Decision Tree Regressor: RMSE- 70269.5738668
Random Forest Regressor: RMSE- 52909.1080535
Support Vector Regressor: RMSE- 110914.791356
Fine Tuning the Hyperparameters for Random Forest Regressor: RMSE- 49261.2835608
House Price Estimates Based on Machine Learning Algorithmijtsrd
Housing prices are increasing every year, necessitating the creation of a long term housing price strategy. Predicting a homes price will assist a developer in determining a homes purchase price, as well as a consumer in determining the best time to buy a home. The sale price of real estate in major cities depends on the specific circumstances. Housing prices are constantly changing from day to day and are sometimes fired rather than based on estimates. Predicting real estate prices by real factors is a key element as part of our analysis. We want to make our test dependent on all of the simple metrics that are taken into account when deciding the significance. In this research we use linear regression techniques pathway and our results are not self inflicted process rather is a weighted method of various techniques to give the most accurate results. There are fifteen features in the data collection. In this research. There has been an effort to build a forecasting model for determining the price based on the variables that influence the price.The results have proven to be effective lower error and higher accuracy than individual algorithms are used. Jakir Khan | Dr. Ganesh D "House Price Estimates Based on Machine Learning Algorithm" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42367.pdf Paper URL: https://www.ijtsrd.comcomputer-science/other/42367/house-price-estimates-based-on-machine-learning-algorithm/jakir-khan
This project aims to determine the housing prices of California properties for new sellers and also for buyers to estimate the profitability of the deal using various regression models.
Below are the details of the models implemented and their performance score:
Linear Regression: RMSE- 68321.7051304
Decision Tree Regressor: RMSE- 70269.5738668
Random Forest Regressor: RMSE- 52909.1080535
Support Vector Regressor: RMSE- 110914.791356
Fine Tuning the Hyperparameters for Random Forest Regressor: RMSE- 49261.2835608
House Price Estimates Based on Machine Learning Algorithmijtsrd
Housing prices are increasing every year, necessitating the creation of a long term housing price strategy. Predicting a homes price will assist a developer in determining a homes purchase price, as well as a consumer in determining the best time to buy a home. The sale price of real estate in major cities depends on the specific circumstances. Housing prices are constantly changing from day to day and are sometimes fired rather than based on estimates. Predicting real estate prices by real factors is a key element as part of our analysis. We want to make our test dependent on all of the simple metrics that are taken into account when deciding the significance. In this research we use linear regression techniques pathway and our results are not self inflicted process rather is a weighted method of various techniques to give the most accurate results. There are fifteen features in the data collection. In this research. There has been an effort to build a forecasting model for determining the price based on the variables that influence the price.The results have proven to be effective lower error and higher accuracy than individual algorithms are used. Jakir Khan | Dr. Ganesh D "House Price Estimates Based on Machine Learning Algorithm" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42367.pdf Paper URL: https://www.ijtsrd.comcomputer-science/other/42367/house-price-estimates-based-on-machine-learning-algorithm/jakir-khan
Predicting Moscow Real Estate Prices with Azure Machine LearningLeo Salemann
With only three months' instruction, a five-person team uses Azure Machine Learning Studio to predict Moscow real estate prices based on property descriptors, macroeconomic indicators, and geospatial data.
House Price Prediction An AI Approach.Nahian Ahmed
Suppose you have a house. And you want to sell it. Through House Price Prediction project you can predict the price from previous sell history.
And we make this prediction using Machine Learning.
A ppt based on predicting prices of houses. Also tells about basics of machine learning and the algorithm used to predict those prices by using regression technique.
ABSTRACT
House Price Index is commonly used to estimate the changes in housing price. Since housing price is strongly correlated to other factors such as location, area, population, it requires other information apart from House price prediction to predict individual housing price. There has been a considerably large number of papers adopting traditional machine learning approaches to predict housing prices accurately, but they rarely concern about the performance of individual models and neglect the less popular yet complex models. As a result, to explore various impacts of features on prediction methods, this paper will apply both traditional and advanced machine learning approaches to investigate the difference among several advanced models. This paper will also comprehensively validate multiple techniques in model implementation on regression and provide an optimistic result for housing price prediction.
INTODUCTION
House price prediction is great project to learn and apply the machine learning algorithm. The basic idea behind this project is we are training the machine using the machine learning algorithm from the data set.
In this busy world it is very difficult to find a house according to our need and budget. It becomes more difficult to find the house in metropolitan cities like Mumbai, Kolkata, Delhi, etc. This project uses the data of Mumbai city in order to train and test the machine so that it become capable of predicting the price of house. Machine learning algorithm makes it easy to know the price of houses depending on the location, area, number of bedrooms, etc.
In this project Random Forest Regression, Linear Regression, and Decision Tree Machine learning algorithm has been used to compare the efficiency of the algorithm. Based on comparison we predict which algorithm best suits for the prediction of price of house in Mumbai.
CONCLUSION AND FUTURE SCOPE
The model designed accuracy depends on the dataset selected, better the dataset better will be the accuracy. Best suited model applied is Random Forest. This can be applied to datset of any city for their house price prediction. The project can be enhanced by UI designing through they can predict the price in more easier and interactive way. In this busy world it will be of immense use to search for a house at near to our workplace.
DATASET LINK
https://www.kaggle.com/
Prediction of house price using multiple regressionvinovk
- Constructed a mathematical model using Multiple Regression to estimate the Selling price of the house based on a set of predictor variables.
- SAS was used for Variable profiling, data transformations, data preparation, regression modeling, fitting data, model diagnostics, and outlier detection.
The main objective of this paper is to recognize and predict handwritten digits from 0 to 9 where data set of 5000 examples of MNIST was given as input. As we know as every person has different style of writing digits humans can recognize easily but for computers it is comparatively a difficult task so here we have used neural network approach where in the machine will learn on itself by gaining experiences and the accuracy will increase based upon the experience it gains. The dataset was trained using feed forward neural network algorithm. The overall system accuracy obtained was 95.7% Jyoti Shinde | Chaitali Rajput | Prof. Mrunal Shidore | Prof. Milind Rane"Handwritten Digit Recognition" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-2 | Issue-2 , February 2018, URL: http://www.ijtsrd.com/papers/ijtsrd8384.pdf http://www.ijtsrd.com/engineering/electronics-and-communication-engineering/8384/handwritten-digit-recognition/jyoti-shinde
Prediction of Diamond Prices Using Multivariate RegressionMohitMhapuskar
The prices of precious diamonds are primarily determined by some sort of combination of the four C's : Carat, Color, Cut and Clarity. Our team used SAS to implement feature selection and multivariate regression to create a regression model that would allow us to predict the prices of diamonds based on those intrinsic characteristics. Our model achieved an accuracy of 94%.
Activation functions and Training Algorithms for Deep Neural networkGayatri Khanvilkar
Training of Deep neural network is difficult task. Deep neural network train with the help of training algorithms and activation function This is an overview of Activation Function and Training Algorithms used for Deep Neural Network. It underlines a brief comparative study of activation function and training algorithms.
HEALTH PREDICTION ANALYSIS USING DATA MININGAshish Salve
Data mining techniques are used for a variety of applications. In healthcare industry, datamining plays an important
role in predicting diseases. For detecting a disease number of tests should be required from the patient. But using data
mining technique the number of tests can be reduced. This reduced test plays an important role in time and performance.
This report analyses data mining techniques which can be used for predicting different types of diseases. This report reviewed
the research papers which mainly concentrate on predicting various disease
Predicting Moscow Real Estate Prices with Azure Machine LearningLeo Salemann
With only three months' instruction, a five-person team uses Azure Machine Learning Studio to predict Moscow real estate prices based on property descriptors, macroeconomic indicators, and geospatial data.
House Price Prediction An AI Approach.Nahian Ahmed
Suppose you have a house. And you want to sell it. Through House Price Prediction project you can predict the price from previous sell history.
And we make this prediction using Machine Learning.
A ppt based on predicting prices of houses. Also tells about basics of machine learning and the algorithm used to predict those prices by using regression technique.
ABSTRACT
House Price Index is commonly used to estimate the changes in housing price. Since housing price is strongly correlated to other factors such as location, area, population, it requires other information apart from House price prediction to predict individual housing price. There has been a considerably large number of papers adopting traditional machine learning approaches to predict housing prices accurately, but they rarely concern about the performance of individual models and neglect the less popular yet complex models. As a result, to explore various impacts of features on prediction methods, this paper will apply both traditional and advanced machine learning approaches to investigate the difference among several advanced models. This paper will also comprehensively validate multiple techniques in model implementation on regression and provide an optimistic result for housing price prediction.
INTODUCTION
House price prediction is great project to learn and apply the machine learning algorithm. The basic idea behind this project is we are training the machine using the machine learning algorithm from the data set.
In this busy world it is very difficult to find a house according to our need and budget. It becomes more difficult to find the house in metropolitan cities like Mumbai, Kolkata, Delhi, etc. This project uses the data of Mumbai city in order to train and test the machine so that it become capable of predicting the price of house. Machine learning algorithm makes it easy to know the price of houses depending on the location, area, number of bedrooms, etc.
In this project Random Forest Regression, Linear Regression, and Decision Tree Machine learning algorithm has been used to compare the efficiency of the algorithm. Based on comparison we predict which algorithm best suits for the prediction of price of house in Mumbai.
CONCLUSION AND FUTURE SCOPE
The model designed accuracy depends on the dataset selected, better the dataset better will be the accuracy. Best suited model applied is Random Forest. This can be applied to datset of any city for their house price prediction. The project can be enhanced by UI designing through they can predict the price in more easier and interactive way. In this busy world it will be of immense use to search for a house at near to our workplace.
DATASET LINK
https://www.kaggle.com/
Prediction of house price using multiple regressionvinovk
- Constructed a mathematical model using Multiple Regression to estimate the Selling price of the house based on a set of predictor variables.
- SAS was used for Variable profiling, data transformations, data preparation, regression modeling, fitting data, model diagnostics, and outlier detection.
The main objective of this paper is to recognize and predict handwritten digits from 0 to 9 where data set of 5000 examples of MNIST was given as input. As we know as every person has different style of writing digits humans can recognize easily but for computers it is comparatively a difficult task so here we have used neural network approach where in the machine will learn on itself by gaining experiences and the accuracy will increase based upon the experience it gains. The dataset was trained using feed forward neural network algorithm. The overall system accuracy obtained was 95.7% Jyoti Shinde | Chaitali Rajput | Prof. Mrunal Shidore | Prof. Milind Rane"Handwritten Digit Recognition" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-2 | Issue-2 , February 2018, URL: http://www.ijtsrd.com/papers/ijtsrd8384.pdf http://www.ijtsrd.com/engineering/electronics-and-communication-engineering/8384/handwritten-digit-recognition/jyoti-shinde
Prediction of Diamond Prices Using Multivariate RegressionMohitMhapuskar
The prices of precious diamonds are primarily determined by some sort of combination of the four C's : Carat, Color, Cut and Clarity. Our team used SAS to implement feature selection and multivariate regression to create a regression model that would allow us to predict the prices of diamonds based on those intrinsic characteristics. Our model achieved an accuracy of 94%.
Activation functions and Training Algorithms for Deep Neural networkGayatri Khanvilkar
Training of Deep neural network is difficult task. Deep neural network train with the help of training algorithms and activation function This is an overview of Activation Function and Training Algorithms used for Deep Neural Network. It underlines a brief comparative study of activation function and training algorithms.
HEALTH PREDICTION ANALYSIS USING DATA MININGAshish Salve
Data mining techniques are used for a variety of applications. In healthcare industry, datamining plays an important
role in predicting diseases. For detecting a disease number of tests should be required from the patient. But using data
mining technique the number of tests can be reduced. This reduced test plays an important role in time and performance.
This report analyses data mining techniques which can be used for predicting different types of diseases. This report reviewed
the research papers which mainly concentrate on predicting various disease
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...NETFest
В этом докладе мы обсудим базовые алгоритмы и области применения Machine Learning (ML), затем рассмотрим практический пример построения системы классификации результатов измерения производительности, получаемых в Unity с помощью внутренней системы Performance Test Framework, для поиска регрессий производительности или нестабильных тестов. Также попробуем разобраться в критериях, по которым можно оценивать производительность алгоритмов ML и способы их отладки.
CSSC × GDSC: Intro to Machine Learning!
Aaron Shah and Manav Bhojak on October 5, 2023
🤖 Join us for an exciting ML Workshop! 🚀 Dive into the world of Machine Learning, where we'll unravel the mysteries of CNNs, RNNs, Transformers, and more. 🤯
Get ready to embark on a journey of discovery! We'll begin with an easy-to-follow introduction to the fascinating realm of ML. 📚
🛠️ In our hands-on session, we'll walk you through setting up your environment. No tech hurdles here! 🌐
🔍 Then, we'll get down to the nitty-gritty, guiding you through our starter code for a thrilling hands-on example. Together, we'll explore the power of ML in action! 💡
https://github.com/telecombcn-dl/dlmm-2017-dcu
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
Slides covered during Analytics Boot Camp conducted with the help of IBM, Venturesity. Special credits to Kumar Rishabh (Google) and Srinivas Nv Gannavarapu (IBM)
How to Win Machine Learning Competitions ? HackerEarth
This presentation was given by Marios Michailidis (a.k.a Kazanova), Current Kaggle Rank #3 to help community learn machine learning better. It comprises of useful ML tips and techniques to perform better in machine learning competitions. Read the full blog: http://blog.hackerearth.com/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
Data science isn't an easy task to pull of.
You start with exploring data and experimenting with models.
Finally, you find some amazing insight!
What now?
How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data?
Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running.
Covering :
* Small - Medium experimentation (R)
* Big data implementation (Spark Mllib /+ pipeline)
* Setting Metrics and checks in place
* Ad hoc querying and exploring your results (Zeppelin)
* Pain points & Lessons learned the hard way (is there any other way?)
Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data.
V2.0 open power ai virtual university deep learning and ai introductionGanesan Narayanasamy
OpenPOWER AI virtual University's - focus on bringing together industry, government and academic expertise to connect and help shape the AI future .
https://www.youtube.com/channel/UCYLtbUp0AH0ZAv5mNut1Kcg
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
2. What is it?
● Grew out of work in Artificial Intelligence (AI)
● 1959 Arthur Samuel – Machine Learning:
● „Field of study that gives computers the ability to
learn without being explicitly programmed.”
● 1998 Tom Mitchell – Well posed learning
problem:
● „A computer program is said to 'learn' from
experience 'E' with respect to some task 'T' and
some performance measure 'P', if its performance
on 'T', as measured by 'P', improves with
experience 'E'.”
3. What is it?
● Example:
● Email program
(experience)
– 'E' – watches you label emails as spam/not spam
(task)
– 'T' – classifies emails as spam/not spam
(performance)
– 'P' – fraction of emails correctly classified as spam/not
spam
4. What is it?
● Solves complicated, underspecified problems
● Some problems can't be solved directly by software
● Instead of writing a program for each problem:
● Collect samples of correct input->output
● Use algorithm to create a program to do the same
● Program handles new cases (other than those in
the training data), retrain if new data
● Massive amounts of data + computation is
cheaper than developing software
http://technocalifornia.blogspot.com/2012/07/more-data-or-better-models.html
5. Problems for Machine Learning
● Pattern recognition
● Objects in real scenes
● Computer vision – facial identities / expressions
● Speech recognition
– Sample sounds
– Partition phonemes
– Decoding – extract meaning, NLP
● Natural language
6. Problems for Machine Learning
● Recognizing anomalies
● Unusual sequences
– Credit / phone fraud
– SPAM / HAM
● Sensor readings
– Power plant operation and health
– Detect when actions are required
7. Problems for Machine Learning
● Prediction
● Stock price movements (time sequence)
● Currency exchange rates
● Risk analytics
● Sentiment analysis
● Click throughs (web traffic)
● Preferences
– Netflix, Amazon, Pandora, web ad targetting, etc.
8. Problems for Machine Learning
● Information Retrieval (database mining)
● Genomics
● News/Twitter data feeds
● Archived data
● Web clicks
● Medical records
● Find similar, summarize groups of material
9. Learning - Supervised
● Predict output given the input, train using inputs
with known outputs
● Regression – target is a real number, goal is to
be 'close'
● Classification – target is a class label: binary
(yes/no) or multi-class (one of many)
10. Learning – Unsupervised
● Older texts explicitly exclude this from being
learning!
● Discover good internal representation of input
● Difficult to determine what the goal is
● Create a representation that can be used in
subsequent supervised learning?
● Dimensionality reduction (PCA) can be used for
compression or to simplify analysis
● Provide an economical high dimensional
representation (binary features, real features –
single largest parameter)
11. Learning – Reinforcement
● Select action to maximize payoff
● Maximize expected sum of future rewards
● Not every action results in a payoff
● Apply discounting to minimize effect of far future on
present decisions
● Difficult – payoffs are delayed, critical decision
points unknown, scalar payoff contains little
information
12. Learning – Reinforcement
● Planning
● Choice of actions by anticipating outcomes
● Actions and planning can be interleaved
(incomplete knowledge)
– Warehouse, dock management, Route
planning/replanning
● Multiple simultaneous agents planning
independently
– Emergency responders
– http://www.aiai.ed.ac.uk/project/i-globe/resources/2007-
03-06-Iglobe/2007-03-06-Iglobe-Demo.avi
13. Learning – Data
● Training data [ ~60% - 80% ]
● Inputs (with correct response for supervised)
● Validation data [ ~20% ]
● Converge by training on multiple sets of data,
improving each time
● Test data [ ~10% - 20% ]
● Not used until training and validation are complete –
measure performance with this data set
14. Learning – Data
● Partition randomly
● Time series data use random subsequences
● Training and test data should be from same
population
● If feature selection or model tuning required
(e.g. PCA parameter mapping) then the tuning
must be done for each training set
15. Learning – Training
● One iteration for each set of input data in the
training data set
● Start with random parameters
● Randomize input data during training
● Calculate model parameters for each input
● Use previous parameter values to calculate
next values using new training input
16. Learning – Bias and Variance
● Bias – algorithm errors
● High bias – underfit
● More training data does not help
● Variance – sensitivity to fluctuations in data
● High variance – overfit
● More training data likely to help
● Irreducible error - noise
18. Learning – (Cross) Validation
● Validation
● Holdout data for tuning model with new data
● Evaluate model using holdout as test set
● Cross validation
● generating models with different holdouts to avoid
overfitting
● n-fold - divide data into n chunks and train n times,
treating a different chunk as the holdout each time
(leave-one-out – same with chunk size of 1)
● Random subsampling – approaches leave-p-out
19. Learning - Improvements
● Things to do when the error is to high
● Get more training data (high variance)
● Try smaller sets of features (high variance)
● Try getting additional features (high bias)
● Add polynomial features (high bias)
● Decrease smoothing parameter λ (high bias)
● Increase smoothing parameter λ (high variance)
20. Learning – Testing
● Reserve set of data [~10% - 20% ]
● Evaluate model performance with the test set
● Make no further model changes
● Performance evaluation
● Supervised learning – compare predictions with
known results
● Predictions of unsupervised model when results
can be known – even if not used in training
22. Training – Gradient Descent
● Linear cost function
● Well behaved
● Single global minimum, easily reached
23. Training – Gradient Descent
● Complex cost functions
● Not well behaved
● Global minimum, many local minima
24. Training – Gradient Descent
● Convergence speed and stability controlled by
slope parameter α
● Low α ● High α
25. Training – k-means
● Classify data into k different groups
● Start with k random points
● Group data with the closest point
● Move the points to the centroid of the data for that
point
● Terminate when the points no longer move (or
move only a small amount)
27. Training – k-nn
● k nearest neighbors determine classification of
each element in data
● Skewed data can result in homogenous result
● Use weighting to avoid this
● Training – store the training data
● For each data point to be predicted
● Locate the nearest k other points
– Use any consistent distance metric – l-p norms (euclidan,
manhattan distances, maximum single direction)
● Assign the majority class of those nearest points
30. Regression
● Single / Multiple variable
● Linear / Logistic
● Regularization (smoothing) – helps to avoid
overfitting
31. Regression – Equations
● Linear regression
hypothesis function
● Logistic regression
hypothesis function
● Regularized linear
regression cost
function
● Regularized logistic
regression cost
function
32. Neural Networks - Representation
● Nodes – compared to neurons, many inputs,
one output
● Transfer characteristic – logistic function
● Input from left, output to right
● Layers
– Input layer, driven by numeric input values
– Output layer, provides numeric output values (or
thresholded for classification output)
– Hidden layers between input and output – no discernable
meaning for their values
34. Neural Networks – Learning
● Learns using gradient descent
● Forward propagation – start at inputs, derive
parameters of next stage
● Backward propagation – start at outputs, adjust
parameters to produce desired output
35. Neural Networks - Learning
● OCR training set
● what does the number '2' look like when
handwritten?
36. Neural Networks - Learning
● Neural Network parameters are not simply
interpretable
37. Support Vector Machines
● Supervised learning classification and
regression algorithm
● Cocktail Party Problem
● Many speakers, many sensors (microphones)
● Classify source from the inputs
[W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
38. Principle Component Analysis
● Unsupervised learning
● Finds basis vectors for data
● Largest is the 'principle' component
● Center each attribute on mean for visualization,
not for prediction models
● Normalized to same range to provide
comparable contributions from each factor
47. Classification - Performance
● Receiver Operating Characteristic (ROC)
● Location of classification performance
● Perfect predictions indicated in upper left corner
● Up and to the left means better
● Diagonal from lower left to upper right indicates
performance equivalent to random guessing
49. Classification - Performance
● Area Under the Curve (AUC)
● ROC chart with curves applied
● Classifications based on thresholds for continuous
random variables
● Curve is parametric plot with the threshold as the
varying parameter
● AUC is a scalar summary of predictive value
51. Natural Language Processing
● Text processing
● Modeling
● Generative models – generate observed data from
hidden parameters
– N-gram, Naive Bayes, HSMM, CFG
● Discriminative models – estimate probability of
hidden parameters from observed data
– Regressions, maximum entropy, conditional random
fields, support vector machines, neural networks
52. NLP - Language Modeling
● Probability of sequences of words (fragments,
sentences)
● Markov assumption
● Product of each element probability conditional on
small preceding sequence
– N-grams: bigrams: single preceding word, trigrams: two
preceeding words
53. NLP - Information Extraction
● Find and understand relevant parts of texts
● Gather information from many sources
● Produce structured representation
● Relations, knowledge base
● Resource Description Framework (RDF)
● Retrieval
● Finding unstructured material in a large collection
● Web/email search, knowledge bases, legal data,
health data, etc.