My slides from the minds mastering machines conference 2018 in Cologne about Deep Learning and Mathematical Optimization, the methods that are used for training Neural Nets and how they perform with respect to Training and especially Learning, i.e. how well the trained predictors generalize
Talk at the Data Science Meetup Hamburg about Deep Learning, the most important Optimization methods in this field and the relationship between training and learning
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGcscpconf
Data clustering is a process of arranging similar data into groups. A clustering algorithm
partitions a data set into several groups such that the similarity within a group is better than
among groups. In this paper a hybrid clustering algorithm based on K-mean and K-harmonic
mean (KHM) is described. The proposed algorithm is tested on five different datasets. The research is focused on fast and accurate clustering. Its performance is compared with the traditional K-means & KHM algorithm. The result obtained from proposed hybrid algorithm is much better than the traditional K-mean & KHM algorithm
Timo Klerx and Kalman Graffi. Bootstrapping Skynet: Calibration and Autonomic Self-Control of Structured Peer-to-Peer Networks. In IEEE P2P ’13: Proceedings of the International Conference on Peer-to-Peer Computing, 2013.
Abstract—Peer-to-peer systems scale to millions of nodes and provide routing and storage functions with best effort quality. In order to provide a guaranteed quality of the overlay functions, even under strong dynamics in the network with regard to peer capacities, online participation and usage patterns, we propose to calibrate the peer-to-peer overlay and to autonomously learn which qualities can be reached. For that, we simulate the peer- to-peer overlay systematically under a wide range of parameter configurations and use neural networks to learn the effects of the configurations on the quality metrics. Thus, by choosing a specific quality setting by the overlay operator, the network can tune itself to the learned parameter configurations that lead to the desired quality. Evaluation shows that the presented self-calibration succeeds in learning the configuration-quality interdependencies and that peer-to-peer systems can learn and adapt their behavior according to desired quality goals.
KNN algorithm is one of the simplest classification algorithm and it is one of the most used learning algorithms. KNN is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.
Talk at the Data Science Meetup Hamburg about Deep Learning, the most important Optimization methods in this field and the relationship between training and learning
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGcscpconf
Data clustering is a process of arranging similar data into groups. A clustering algorithm
partitions a data set into several groups such that the similarity within a group is better than
among groups. In this paper a hybrid clustering algorithm based on K-mean and K-harmonic
mean (KHM) is described. The proposed algorithm is tested on five different datasets. The research is focused on fast and accurate clustering. Its performance is compared with the traditional K-means & KHM algorithm. The result obtained from proposed hybrid algorithm is much better than the traditional K-mean & KHM algorithm
Timo Klerx and Kalman Graffi. Bootstrapping Skynet: Calibration and Autonomic Self-Control of Structured Peer-to-Peer Networks. In IEEE P2P ’13: Proceedings of the International Conference on Peer-to-Peer Computing, 2013.
Abstract—Peer-to-peer systems scale to millions of nodes and provide routing and storage functions with best effort quality. In order to provide a guaranteed quality of the overlay functions, even under strong dynamics in the network with regard to peer capacities, online participation and usage patterns, we propose to calibrate the peer-to-peer overlay and to autonomously learn which qualities can be reached. For that, we simulate the peer- to-peer overlay systematically under a wide range of parameter configurations and use neural networks to learn the effects of the configurations on the quality metrics. Thus, by choosing a specific quality setting by the overlay operator, the network can tune itself to the learned parameter configurations that lead to the desired quality. Evaluation shows that the presented self-calibration succeeds in learning the configuration-quality interdependencies and that peer-to-peer systems can learn and adapt their behavior according to desired quality goals.
KNN algorithm is one of the simplest classification algorithm and it is one of the most used learning algorithms. KNN is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
K-means and K-medoids clustering algorithms are widely used for many practical applications. Original k
medoids algorithms select initial centroids and medoids randomly that affect the quality of the resulting clusters and sometimes it
generates unstable and empty clusters which are meaningless.
expensive and requires time proportional to the product of the number of data items, number of clusters and the number of iterations.
The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centro
requirements of users and then gives better, effective and stable cluster. It also takes less execution time because it eliminates
unnecessary distance computation by using previous iteration. The new approach for k
systematically based on initial centroids. It generates stable clusters to improve accuracy.
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
Reinforcement Learning (RL) is a genre of Machine Learning in which an agent learns to choose optimal actions in different states in order to reach its specified goal, solely by interacting with the environment through trial and error. Unlike supervised learning, the agent does not get examples of "correct" actions in given states as ground truth. Instead, it has to use feedback from the environment (which can be sparse and delayed) to improve its policy over time. The formulation of the RL problem closely resembles the way in which human beings learn to act in different situations. Hence it is often considered the gateway to achieving the goal of Artificial General Intelligence.
The motivation of this talk is to introduce the audience to key theoretical concepts like formulation of the RL problem using Markov Decision Process (MDP) and solution of MDP using dynamic programming and policy gradient based algorithms. State-of-the-art deep reinforcement learning algorithms will also be covered. A case study of the application of reinforcement learning in robotics will also be presented.
Detecing facial keypoints is a very challenging problem. Facial features vary greatly from one individual to another, and even for a single individual, there is a large amount of variation due to 3D pose, size, position, viewing angle, and illumination conditions. Computer vision research has come a long way in addressing these difficulties, but there remain many oppurtunities for improvement.
In this presentation we have used different methods to recognize facial keypoints and compared their RMSE (Root Mean Square Errors) to get better results and accuracy.
In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method proposed by Thomas Cover used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space.
K Nearest Neighbor V1.0 Supervised Machine Learning AlgorithmDataMites
Are you planning to learn machine learning algorithms?
Go through the slides for K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm information.
DataMites is providing a data science course with Machine learning algorithms. Join classroom training or ONLINE training for your course and get certified at the end of the course as a certified data scientist.
For more details visit: https://datamites.com/data-science-course-training-bangalore/
This is very simple introduction to Clustering with some real world example. At the end of lecture I use stackOverflow API to test some clustering. I also wants to try facebook but it has some problem with it's API
This Presentation discusses he following topics:
Introduction
Need for Problem formulation
Problem Solving Components
Definition of Problem
Problem Limitation
Goal or Solution
Solution Space
Operators
Examples of Problem Formulation
Well-defined Problems and Solution
Examples of Well-Defined Problems
Constraint satisfaction problems (CSPs)
Examples of constraint satisfaction problem
Decision problem
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
K-means and K-medoids clustering algorithms are widely used for many practical applications. Original k
medoids algorithms select initial centroids and medoids randomly that affect the quality of the resulting clusters and sometimes it
generates unstable and empty clusters which are meaningless.
expensive and requires time proportional to the product of the number of data items, number of clusters and the number of iterations.
The new approach for the k mean algorithm eliminates the deficiency of exiting k mean. It first calculates the initial centro
requirements of users and then gives better, effective and stable cluster. It also takes less execution time because it eliminates
unnecessary distance computation by using previous iteration. The new approach for k
systematically based on initial centroids. It generates stable clusters to improve accuracy.
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
Reinforcement Learning (RL) is a genre of Machine Learning in which an agent learns to choose optimal actions in different states in order to reach its specified goal, solely by interacting with the environment through trial and error. Unlike supervised learning, the agent does not get examples of "correct" actions in given states as ground truth. Instead, it has to use feedback from the environment (which can be sparse and delayed) to improve its policy over time. The formulation of the RL problem closely resembles the way in which human beings learn to act in different situations. Hence it is often considered the gateway to achieving the goal of Artificial General Intelligence.
The motivation of this talk is to introduce the audience to key theoretical concepts like formulation of the RL problem using Markov Decision Process (MDP) and solution of MDP using dynamic programming and policy gradient based algorithms. State-of-the-art deep reinforcement learning algorithms will also be covered. A case study of the application of reinforcement learning in robotics will also be presented.
Detecing facial keypoints is a very challenging problem. Facial features vary greatly from one individual to another, and even for a single individual, there is a large amount of variation due to 3D pose, size, position, viewing angle, and illumination conditions. Computer vision research has come a long way in addressing these difficulties, but there remain many oppurtunities for improvement.
In this presentation we have used different methods to recognize facial keypoints and compared their RMSE (Root Mean Square Errors) to get better results and accuracy.
In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method proposed by Thomas Cover used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space.
K Nearest Neighbor V1.0 Supervised Machine Learning AlgorithmDataMites
Are you planning to learn machine learning algorithms?
Go through the slides for K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm information.
DataMites is providing a data science course with Machine learning algorithms. Join classroom training or ONLINE training for your course and get certified at the end of the course as a certified data scientist.
For more details visit: https://datamites.com/data-science-course-training-bangalore/
This is very simple introduction to Clustering with some real world example. At the end of lecture I use stackOverflow API to test some clustering. I also wants to try facebook but it has some problem with it's API
This Presentation discusses he following topics:
Introduction
Need for Problem formulation
Problem Solving Components
Definition of Problem
Problem Limitation
Goal or Solution
Solution Space
Operators
Examples of Problem Formulation
Well-defined Problems and Solution
Examples of Well-Defined Problems
Constraint satisfaction problems (CSPs)
Examples of constraint satisfaction problem
Decision problem
Chap 8. Optimization for training deep modelsYoung-Geun Choi
연구실 내부 세미나 자료. Goodfellow et al. (2016), Deep Learning, MIT Press의 Chapter 8을 요약/발췌하였습니다. 깊은 신경망(deep neural network) 모형 훈련시 목적함수 최적화 방법으로 흔히 사용되는 방법들을 소개합니다.
Machine learning in science and industry — day 1arogozhnikov
A course of machine learning in science and industry.
- notions and applications
- nearest neighbours: search and machine learning algorithms
- roc curve
- optimal classification and regression
- density estimation
- Gaussian mixtures and EM algorithm
- clustering, an example of clustering in the opera
In the world of recommendation systems, there are various theories and algorithms that work together to give the best results. Among these, the core recommendation algorithm is crucial. This paper will provide an introduction to some fundamental algorithms used in recommendation systems. These algorithms are like building blocks that help make recommendations more effective.
Gradient-based Meta-learning with learned layerwise subspace and metricNAVER Engineering
발표자: 이윤호(포스텍 석사과정)
발표일: 2018.2.
일반적인 딥러닝 모델은 데이터가 많아야 학습이 가능합니다.
적은 데이터로도 학습을 하기 위해서는 학습과정에 대한 학습인 메타러닝(meta-learning)이 필요합니다.
저희가 제안하는 MT-net은 gradient descent를 할 subspace와 그 subspace 안에서의 거리함수를 학습함으로써 적은 데이터로도 효율적으로 학습할 수 있는 모델입니다.
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
The automatic question answering (QA) task has long been considered a primary objective of artificial intelligence.
Among the QA sub-systems, we focused on answer-ranking part. In particular, we investigated a novel neural network architecture with additional data clustering module to improve the performance in ranking answer candidates which are longer than a single sentence. This work can be used not only for the QA ranking task, but also to evaluate the relevance of next utterance with given dialogue generated from the dialogue model.
In this talk, I'll present our research results (NAACL 2018), and also its potential use cases (i.e. fake news detection). Finally, I'll conclude by introducing some issues on previous research, and by introducing recent approach in academic.
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC BerkeleyCharles Martin
Talk given on Dec 13, 2018 at ICSI, UC Berkeley
http://www.icsi.berkeley.edu/icsi/events/2018/12/regularization-neural-networks
Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models and smaller models trained from scratch. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of self-regularization, implicitly sculpting a more regularized energy or penalty landscape. In particular, the empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, and applying them to these empirical results, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of implicit self-regularization. For smaller and/or older DNNs, this implicit self-regularization is like traditional Tikhonov regularization, in that there appears to be a ``size scale'' separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of heavy-tailed self-regularization, similar to the self-organization seen in the statistical physics of disordered systems. Moreover, we can use these heavy tailed results to form a VC-like average case complexity metric that resembles the product norm used in analyzing toy NNs, and we can use this to predict the test accuracy of pretrained DNNs without peeking at the test data.
This talk was presented in Startup Master Class 2017 - http://aaiitkblr.org/smc/ 2017 @ Christ College Bangalore. Hosted by IIT Kanpur Alumni Association and co-presented by IIT KGP Alumni Association, IITACB, PanIIT, IIMA and IIMB alumni.
My co-presenter was Biswa Gourav Singh. And contributor was Navin Manaswi.
http://dataconomy.com/2017/04/history-neural-networks/ - timeline for neural networks
Transfer learning with LTANN-MEM & NSA for solving multi-objective symbolic r...Amr Kamel Deklel
Abstract
Long Term Artificial Neural Network Memory (LTANN-MEM) and Neural Symbolization Algorithm (NSA)
are proposed for solving symbolic regression problems. Although this approach is capable of solving Boolean
decoder problems of sizes 6, 11 and 20, it is not capable of solving decoder problems of higher dimensions like
decoder-37; decoder-n is decoder with sum of inputs and outputs is n for example decoder-20 is decoder with 4
inputs and 16 outputs. It is shown here that LTANN-MEM and NSA approach is a kind of transfer learning
however it lacks for sub tasking transfer and updatable LTANN-MEM. An approach for adding the sub tasking
transfer and LTANN-MEM updates is discussed here and examined by solving decoder problems of sizes 37, 70
and 135 efficiently. Comparisons with two learning classifier systems are performed and it is found that the
proposed approach in this work outperforms both of them. It is shown that the proposed approach is used also for
solving decoder-264 efficiently. According to the best of our knowledge, there is no reported approach for solving
this high dimensional problem.
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
The ability to mine and extract useful information automatically, from large datasets, is a
common concern for organizations (having large datasets), over the last few decades. Over the
internet, data is vastly increasing gradually and consequently the capacity to collect and store
very large data is significantly increasing.
Existing clustering algorithms are not always efficient and accurate in solving clustering
problems for large datasets.
However, the development of accurate and fast data classification algorithms for very large
scale datasets is still a challenge. In this paper, various algorithms and techniques especially,
approach using non-smooth optimization formulation of the clustering problem, are proposed
for solving the minimum sum-of-squares clustering problems in very large datasets. This
research also develops accurate and real time L2-DC algorithm based with the incremental
approach to solve the minimum
This is the slides of my master defense; 17 april 2003
subject: "High capacity neural network optimization problems: study & solutions exploration"
Facebook Talk at Netflix ML Platform meetup Sep 2019Faisal Siddiqi
In this talk at the Netflix Machine Learning Platform Meetup on 12 Sep 2019, Sam Daulton from Facebook discusses "Practical Solutions to real-world exploration problems".
Similar to The Machinery behind Deep Learning (20)
Everyone is talking about Data Mesh architectures already - assuming that there is already a full-fledged self-service data platform in place. A reality check reveals, that most (data) platforms are not really working that well, and fail to deliver value at scale. And in contrast to the business notion of a platform, where network effects make platforms even more valuable, the more users and products are there, this does not hold true for data platforms in particular (at least I haven't seen a proof so far).
So where to start, when data-transforming an organization? One approach, inspired by the Lean framework, is outlined in this talk. It all starts with what is actually working - identify some (data) products that drive value already. These are the ones you can build a platform for. It's a myth that you just need to build a solid platform, and then everyone will come and build amazing data products. They will never come. But starting with what already works is a reasonable first step. Step two is about creating flow, supporting the value stream end-to-end. Co-creation is your main tool here, fostering collaboration and ownership. Then you can think of platformizing what is really, really needed, avoiding the "waste" that modern data systems / platforms / architectures tend to pile up. In the end, the "right" architecture for your organization will emerge, you cannot simply copy-paste "solutions" that are not addressing your specific challenges.
Long story short, there is a path to success, but it's not easy, it's not copying others, it's finding your own way. And as in all good strategies, you can specify the "qualities" you'd like to see in the end. And the concrete solutions need to emerge from the hard work of the motivated people, that are already driving value for your organization now.
Data teams are contributing to a variety of value streams, as they are delivering value to a variety of stakeholders. The value streams are often not well-supported and the involved teams are facing constant challenges like Data Quality and Data Ownership. Also, data products often rely on the same data points for building the product and for measuring its success - so a lack of data quality leads to poor product quality and weak measurability at the same time. These challenges become exponentially harder, the larger the organization has grown. We propose a way of conceptualizing and visualizing the process of building data products, using the concept of the data value chain. Applying the Five Principles of Lean, especially Defining Value and Mapping out Data Value Streams, to the way build data products and operate data systems at scale, we create a framework that allows to focus on value delivery, avoids "waste" and supports ownership.
Talk at MCubed London about Manifold Learning and ApplicationsStefan Kühn
How to make use of of Manifold Learning methods for Dimensionality Reduction, Data Visualization and Automated Feature Engineering, this time also with UMAP - most of the cool stuff is in the Jupyter notebooks
Talk at PyData Berlin about Manifold Learning and ApplicationsStefan Kühn
These are the slides from my talk at PyData Berlin about how to use Manifold Learning in the context of Data Visualization and Feature Engineering. There are several jupyter notebooks exporing this, you can find the on github under https://github.com/cc-skuehn/Manifold_Learning
Manifold Learning and Data VisualizationStefan Kühn
Talk at PyData Hamburg 2018-03-01 about Manifold Learning and Data Visualization with Python and Scikit-learn plus Random Projections and PCA, includes links to all resources and the github repository with worked examples in form of jupyter notebooks - we recommend using jupyter lab
Visualizing and Communicating High-dimensional DataStefan Kühn
Slides from my talk at Data Natives, starting with the different Modes of Perception, the components of Visualization and Graphics and how to transport Information efficiently, then giving examples of how modern approximation techniques - manifold learning, principal curves - and visualization techniques - pair plots, correlation plots, parallel coordinates, grand tour - can be used in order to approach complex multi-dimensional data.
Data quality - The True Big Data ChallengeStefan Kühn
Data Quality is one of the most-overlooked key aspect in any Big Data project or approach. This talk adresses the problem from various perspectives, discusses the main challenges and identifies possible solutions.
In this talk we discuss the connections between (Supervised) Learning and Mathematical Optimization. Topics include iterative algorithm, search directions and stepsizes. The talk has been held at the Computer Science, Machine Learning and Statistics Meetup Hamburg.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
The Machinery behind Deep Learning
1. The Machinery behind Deep Learning
Stefan Kühn
Join me on XING
Minds Mastering Machines - Cologne - April 26th, 2018
Stefan Kühn (XING) Deep Optimization 26.04.2018 1 / 35
2. Contents
1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 26.04.2018 2 / 35
3. 1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 26.04.2018 3 / 35
4. Deep Learning
Neural Networks - Universal Approximation Theorem
1-hidden-layer feed-forward neural net with finite number of parameters can
approximate any continuous function on compact subsets of Rn
Questions:
Why do we need deep learning at all?
theoretic result, requires wide nets
approximation by piecewise constant functions (not what you might
want for classification/regression)
deep nets can replicate capacity of wide shallow nets with performance
and stability improvements
Why are deep nets harder to train than shallow nets?
More parameters to be learned by training?
More hyperparameters to be set before training?
Numerical issues?
disclaimer — ideas stolen from Martens, Sutskever, Bengio et al. and many more —
Stefan Kühn (XING) Deep Optimization 26.04.2018 4 / 35
5. Example: RNNs
Recurrent Neural Nets
Extremely powerful for modeling sequential data, e.g. time series but
extremely hard to train (somewhat less hard for LSTMs/GRUs)
Main Advantages:
Qualitatively: Flexible and rich model class
Practically: Gradients easily computed by Backpropagation (BPTT)
Main Problems:
Qualitatively: Learning long-term dependencies
Practically: Gradient-based methods struggle when separation between
input and target output is large
Stefan Kühn (XING) Deep Optimization 26.04.2018 5 / 35
6. Example: RNNs
Recurrent Neural Nets
Highly volatile relationship between parameters and hidden states
Indicators
Vanishing/exploding gradients
Internal Covariate Shift
Remedies
ReLU
’Careful’ initialization
Small stepsizes
(Recurrent) Batch Normalization
Stefan Kühn (XING) Deep Optimization 26.04.2018 6 / 35
7. Example: RNNs
Recurrent Neural Nets and LSTM
Schmidhuber/Hochreiter proposed change of RNN architecture by adding
Long Short-Term Memory Units
Vanishing/exploding gradients?
fixed linear dynamics, no longer problematic
Any questions open?
Gradient-based trainings works better with LSTMs
LSTMs can compensate one deficiency of Gradient-based learning but
is this the only one?
Most problems are related to specific numerical issues.
Stefan Kühn (XING) Deep Optimization 26.04.2018 7 / 35
8. 1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 26.04.2018 8 / 35
9. Notions of Optimality
Mathematical Optimization
Minimize a given loss function by a certain optimization method or strategy
until convergence.
Vidal et al, Mathematics of Deep Learning
Stefan Kühn (XING) Deep Optimization 26.04.2018 9 / 35
10. Notions of Optimality
Mathematical Optimization
Minimize a given loss function by a certain optimization method or strategy
until convergence.
Local Optimum: Minimum in local neighborhood (global minimum
might not even exist)
Global Optimum: Point with lowest function value (if existing)
Critical points: Candidates for local/global optima, or saddle points
Iterative Minimization: Step-by-step approach to find minima
Descent direction: Direction in which the function value decreases at
least for a small steps
Gradient: For differentiable functions the negative gradient always is a
descent direction and it vanishes at critical points
Stefan Kühn (XING) Deep Optimization 26.04.2018 10 / 35
11. Optimality and Deep Neural Nets
Some surprisingly strong theoretical results for this nonlinear+nonconvex
optimization problem - and practical evidence as well!
Saddle points: In high-dimensional convex problem critical points are
saddle points -> could not be observed for Deep Nets
Local and global optimal: Deep Nets seem to have the property that
local optima are located near the global optimum
Optimal representation: Deep Nets can represent data optimally under
certain conditions (minimal sufficient statistic)
Information Theory: Deep Nets and entropy are becoming best friend,
strong relations to optimal control theory (Optimization in infinite
dimensions)
Global optimality for positively homogeneous networks:
self-explanatory
Stefan Kühn (XING) Deep Optimization 26.04.2018 11 / 35
12. Notions of Error
Decomposition of the Error
Even the best possible prediction - the optimal prediction via the so-called
Bayes predictor - comes with an error.
Error Components
Bayes Error: Theoretically optimal error
Approximation Error: Error introduced by the model class
Estimation Error: Error introduced by parameter estimation / model
training / optimization method
Stefan Kühn (XING) Deep Optimization 26.04.2018 12 / 35
13. Notions of Error
Example
Bayes Error: Even the optimal predictor for house prices using only zip
codes makes an error -> Property of the data / features
Approximation Error: Linear Models cannot resolve non-linear
relationships between the features irrespective of the training method
(but possibly could with different features, like e.g. polynomial
regression)
Estimation Error: Did we select the right model from the model class
based on the available data? -> depends on model class, data and
training / optimization method
But what about the Generalization Error?
Stefan Kühn (XING) Deep Optimization 26.04.2018 13 / 35
14. Notions of Learning
Learning
A core objective of a learner is to generalize from its experience.
But why do we use Mathematical Optimization for Learning?
What would be an alternative? Biology?
Stefan Kühn (XING) Deep Optimization 26.04.2018 14 / 35
15. Trade-offs between Optimization and Learning
Computational complexity becomes the limiting factor when one envisions
large amounts of training data. [Bouttou, Bousquet]
Underlying Idea
Approximate optimization algorithms might be sufficient for learning
purposes. [Bouttou, Bousquet]
Implications:
Small-scale: Trade-off between approximation error and estimation
error
Large-scale: Computational complexity dominates
Long story short:
The best optimization methods might not be the best learning
methods!
Stefan Kühn (XING) Deep Optimization 26.04.2018 15 / 35
16. Empirical results
Empirical evidence for SGD being a better learner than optimizer.
RCV1, text classification, see e.g. Bouttou, Stochastic Gradient Descent Tricks
Stefan Kühn (XING) Deep Optimization 26.04.2018 16 / 35
17. 1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 26.04.2018 17 / 35
18. Advanced Concepts in Mathematical Optimization
Stepsize rules: Dynamically adjust step lengths to speed up
convergence
Preconditioning: Helps with ill-conditioned problems -> pathological
curvature
Damping: A strategy for making ill-posed problem regular -> helps
making local methods (Newton) work globally
Trust region: Determine step length - or radius of trust - first and then
look for good/best descent directions
Relaxation: Relax constraints for better tractability
Combine simple and complex methods: Levenberg-Marquardt
algorithm, combines Gradient Descent and Newton’s method (ensures
global convergence plus fast local convergence)
Stefan Kühn (XING) Deep Optimization 26.04.2018 18 / 35
19. Gradient Descent
Minimize a given objective function f :
min f (x), x ∈ Rn
Direction of Steepest Descent, the negative gradient:
d = − f (x)
Update in step k
xk+1 = xk − α f (xk)
Properties:
always a descent direction, no test needed
locally optimal, globally convergent
works with inexact line search, e.g. Armijo’s rule
Stefan Kühn (XING) Deep Optimization 26.04.2018 19 / 35
20. Stochastic Gradient Descent
Setting
x model parameters
f (x) :=
i
fi (x), loss function is sum of individual losses
f (x) :=
i
fi (x), i = 1, . . . , m number of training examples
Choose i and update in step k
xk+1 = xk − α fi (xk)
Stefan Kühn (XING) Deep Optimization 26.04.2018 20 / 35
21. Shortcomings of Gradient Descent
local: only local information used
especially: no curvature information used
greedy: prefers high curvature directions
scale invariant: no
James Martens, Deep learning via Hessian-free optimization
Stefan Kühn (XING) Deep Optimization 26.04.2018 21 / 35
22. Momentum
Update in step k
zk+1 = βzk + f (xk)
xk+1 = xk − αzk+1
Properties for a quadratic convex objective:
condition number κ of improves by square root
stepsizes can be twice as long
order of convergence
√
κ − 1
√
κ + 1
instead of
κ − 1
κ + 1
can diverge, if β is not properly chosen/adapted
Gabriel Goh, Why momentum really works
Stefan Kühn (XING) Deep Optimization 26.04.2018 22 / 35
23. Momentum
D E M O
https://distill.pub/2017/momentum/
Stefan Kühn (XING) Deep Optimization 26.04.2018 23 / 35
24. Adam
Properties:
combines several clever tricks (from Momentum, RMSprop, AdaGrad)
has some similarities to Trust Region methods
empirically proven - best in class (personal opinion)
Kingma, Ba Adam: A method for stochastic optimization
Stefan Kühn (XING) Deep Optimization 26.04.2018 24 / 35
25. SGD, Momentum and more
D E M O
Visualization of algorithms - by Sebastian Ruder
Stefan Kühn (XING) Deep Optimization 26.04.2018 25 / 35
26. Beyond Adam
Adam has problems (and it’s not Eve)
Parameters are coupled
Some results indicate that Adam has not the best generalization
properties
It’s a heuristic -> convergence guarantuee?
And Adam has friends!
New variants that decouple parameters
Combine Adam - better at early training stages - and SGD - better
generalization properties
This also helps with convergence!
Wilson et al The Marginal Value of Adaptive Gradient Methods in Machine Learning
Keskar, Socher Improving Generalization Performance by Switching from Adam to SGD
Stefan Kühn (XING) Deep Optimization 26.04.2018 26 / 35
27. Higher-Order Methods
Second-Order Methods
Require existence of Hessian and use this for scaling gradients accordingly,
very successful but computationally expensive
Classical Newton Method: fast local convergence, no global
convergence
Relaxed Newton Methods: help with global convergence
Damped Newton Methods: help with global convergence
Modified Newton Methods: help with computational complexity
Quasi-Newton Methods: help with computational complexity
Nonlinear Conjugate Gradient Methods: iteratively build approximation
to Hessian
But there is a lot more to explore, e.g. the basin-hopping algorithm - a strategy for finding global optima - or
derivative-free methods like Nelder-Mead (downhill simplex), Particle Swarm Optimization (PSO and its variants)
Stefan Kühn (XING) Deep Optimization 26.04.2018 27 / 35
28. L-BFGS and Nonlinear CG
Observations so far:
The better the method, the more parameters to tune.
All better methods try to incorporate curvature information.
Why not doing so directly?
L-BFGS
Quasi-Newton method, builds an approximation of the (inverse) Hessian
and scales gradient accordingly.
Nonlinear CG
Informally speaking, Nonlinear CG tries to solve a quadratic approximation
of the function.
No surprise: They also work with minibatches.
Stefan Kühn (XING) Deep Optimization 26.04.2018 28 / 35
29. Empirical results
Empirical evidence for better optimizers being better learners.
MNIST, handwritten digit recognition, from Ng et al., On Optimization Methods for Deep Learning
Stefan Kühn (XING) Deep Optimization 26.04.2018 29 / 35
30. Truncated Newton: Hessian-Free Optimization
Main ideas:
Approximate not Hessian H, but matrix-vector product Hd.
Use finite differences instead of exact Hessian.
Use damping.
Use Linear CG method for solving quadratic approximation.
Use clever mini-batch stragegy for large data-sets.
Stefan Kühn (XING) Deep Optimization 26.04.2018 30 / 35
31. Empirical test on pathological problems
Main results:
The addition problem is known to be effectively impossible for
gradient descent, HF did it.
Basic RNN cells are used, no specialized architectures (LSTMs etc.).
(Martens/Sutskever (2011), Hochreiter/Schmidhuber, (1997)
Stefan Kühn (XING) Deep Optimization 26.04.2018 31 / 35
32. 1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways
Stefan Kühn (XING) Deep Optimization 26.04.2018 32 / 35
33. Summary
In the long run, the biggest bottleneck will be the sequential parts of an
algorithm. That’s why the number of iterations needs to be small. SGD and
its successors tend to have much more iterations, and they cannot benefit
as much from higher parallelism (GPUs).
But whatever you do/prefer/choose:
At least try out successors of SGD: Momentum, Adam etc.
Look for generic approaches instead of more and more specialized and
manually finetuned solutions.
Key aspects:
Initialization
Adaptive choice of stepsizes/momentum/. . .
Scaling of the gradient
Stefan Kühn (XING) Deep Optimization 26.04.2018 33 / 35
34. Resources
Overview of Gradient Descent methods
Why momentum really works
Adam - A Method for Stochastic Optimization
Mathematics of Deep Learning
The Marginal Value of Adaptive Gradient Methods in Machine
Learning
Andrew Ng et al. about L-BFGS and CG outperforming SGD
Lecture Slides Neural Networks for Machine Learning - Hinton et al.
On the importance of initialization and momentum in deep learning
Data-Science-Blog: Summary article in preparation (Stefan Kühn)
The Neural Network Zoo
Stefan Kühn (XING) Deep Optimization 26.04.2018 34 / 35