This document summarizes key points from a lecture on training neural networks. It discusses initialization of activation functions like ReLU, preprocessing data through normalization, and initializing weights. For activation functions, ReLU is commonly used due to its computational efficiency and ability to avoid saturated gradients. Data is often preprocessed by subtracting the mean to center it. Weight initialization techniques ensure gradients flow properly during training.
This document summarizes key points from a lecture on training neural networks. It discusses activation functions like ReLU, LeakyReLU, and ELU and their benefits over sigmoid and tanh. It covers data preprocessing techniques like subtracting the mean, dividing by standard deviation, and occasionally whitening. Mini-batch stochastic gradient descent is introduced as the optimization algorithm, involving sampling batches, calculating loss, backpropagation for gradients, and parameter updates. Evaluation techniques like ensembling, test augmentation, and transfer learning are also briefly mentioned.
This document summarizes an academic lecture on convolutional neural network architectures. It begins with an overview of common CNN components like convolution layers, pooling layers, and normalization techniques. It then reviews the architectures of seminal CNN models including AlexNet, VGG, GoogLeNet, ResNet and others. As an example, it walks through the architecture of AlexNet in detail, explaining the parameters and output sizes of each layer. The document provides a high-level history of architectural innovations that drove improved performance on the ImageNet challenge.
This document outlines the agenda for the CS231n: Deep Learning for Computer Vision lecture. It introduces image classification as a core task in computer vision and discusses how deep learning approaches like convolutional neural networks (CNNs) are important tools for visual recognition problems. The lecture provides an overview of the course, which covers topics like CNNs, object detection, segmentation, and applications of deep learning beyond 2D images and computer vision.
Introduction to Artificial Neural Networks - PART III.pdfSasiKala592103
The document discusses various artificial neural network models including McCulloch-Pitts neurons, Hebb networks, perceptron networks, and linear separability. It provides examples of implementing logic functions like AND, OR, and XOR using these different neural network models. The key aspects covered are the architecture, calculation of net input and output, and weight update rules for learning. Hands-on examples are presented to illustrate how each model can learn basic functions through training on input-output data pairs.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
This document summarizes a presentation on support vector machines (SVMs). It begins with a recap of neural networks and then provides an introduction to SVMs, explaining how they use hyperplanes to separate data points of different classes. It covers linear SVMs for both separable and non-separable data, explaining how support vectors and slack variables handle non-separable cases. It concludes by discussing how kernels allow SVMs to find non-linear decision boundaries by projecting data into higher-dimensional spaces.
This document summarizes key points from a lecture on training neural networks. It discusses activation functions like ReLU, LeakyReLU, and ELU and their benefits over sigmoid and tanh. It covers data preprocessing techniques like subtracting the mean, dividing by standard deviation, and occasionally whitening. Mini-batch stochastic gradient descent is introduced as the optimization algorithm, involving sampling batches, calculating loss, backpropagation for gradients, and parameter updates. Evaluation techniques like ensembling, test augmentation, and transfer learning are also briefly mentioned.
This document summarizes an academic lecture on convolutional neural network architectures. It begins with an overview of common CNN components like convolution layers, pooling layers, and normalization techniques. It then reviews the architectures of seminal CNN models including AlexNet, VGG, GoogLeNet, ResNet and others. As an example, it walks through the architecture of AlexNet in detail, explaining the parameters and output sizes of each layer. The document provides a high-level history of architectural innovations that drove improved performance on the ImageNet challenge.
This document outlines the agenda for the CS231n: Deep Learning for Computer Vision lecture. It introduces image classification as a core task in computer vision and discusses how deep learning approaches like convolutional neural networks (CNNs) are important tools for visual recognition problems. The lecture provides an overview of the course, which covers topics like CNNs, object detection, segmentation, and applications of deep learning beyond 2D images and computer vision.
Introduction to Artificial Neural Networks - PART III.pdfSasiKala592103
The document discusses various artificial neural network models including McCulloch-Pitts neurons, Hebb networks, perceptron networks, and linear separability. It provides examples of implementing logic functions like AND, OR, and XOR using these different neural network models. The key aspects covered are the architecture, calculation of net input and output, and weight update rules for learning. Hands-on examples are presented to illustrate how each model can learn basic functions through training on input-output data pairs.
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
This document summarizes a presentation on support vector machines (SVMs). It begins with a recap of neural networks and then provides an introduction to SVMs, explaining how they use hyperplanes to separate data points of different classes. It covers linear SVMs for both separable and non-separable data, explaining how support vectors and slack variables handle non-separable cases. It concludes by discussing how kernels allow SVMs to find non-linear decision boundaries by projecting data into higher-dimensional spaces.
The document summarizes the Action-Decision Networks (ADNets) approach for visual object tracking proposed by Yun et al. in 2017. ADNets address problems with previous trackers like fault tolerance, search region size, and use of unlabeled data. It trains on labeled video frames using supervised learning and reinforcement learning on tracking simulations. When confidence is low, it performs re-detection near the predicted box. Experimental results show ADNet achieves state-of-the-art performance while running in real-time.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Introduction to Artificial Neural Networks - PART IV.pdfSasiKala592103
This document discusses shallow neural networks and multi-layer perceptrons. It defines shallow neural networks as having one hidden layer, while deep neural networks have many hidden layers. Multi-layer perceptrons are introduced as a type of shallow neural network that can solve problems like the XOR problem using a hidden layer. The backpropagation algorithm is described as a way to train multi-layer perceptrons by propagating error backwards from the output to update weights between layers. An example is provided to demonstrate how forward and backward passes are used in backpropagation to train a multi-layer perceptron.
Summary:
There are three parts in this presentation.
A. Why do we need Convolutional Neural Network
- Problems we face today
- Solutions for problems
B. LeNet Overview
- The origin of LeNet
- The result after using LeNet model
C. LeNet Techniques
- LeNet structure
- Function of every layer
In the following Github Link, there is a repository that I rebuilt LeNet without any deep learning package. Hope this can make you more understand the basic of Convolutional Neural Network.
Github Link : https://github.com/HiCraigChen/LeNet
LinkedIn : https://www.linkedin.com/in/YungKueiChen
How can we apply machine learning techniques on graphs to obtain predictions in a variety of domains? Know more from Sami Abu-El-Haija, an AI Scientist with experience from both industry (Google Research) and academia (University of Southern California).
Deep Learning Hardware: Past, Present, & FutureRouyun Pan
Yann LeCun gave a presentation on deep learning hardware, past, present, and future. Some key points:
- Early neural networks in the 1960s-1980s were limited by hardware and algorithms. The development of backpropagation and faster floating point hardware enabled modern deep learning.
- Convolutional neural networks achieved breakthroughs in vision tasks in the 1980s-1990s but progress slowed due to limited hardware and data.
- GPUs and large datasets like ImageNet accelerated deep learning research starting in 2012, enabling very deep convolutional networks for computer vision.
- Recent work applies deep learning to new domains like natural language processing, reinforcement learning, and graph networks.
- Future challenges include memory-aug
The document discusses recursion, which is a programming technique where a method calls itself to fulfill its purpose. It provides examples of recursion in math concepts like factorials and describes how to implement recursion in Java methods. Recursion involves a base case, which terminates the recursive calls, and a recursive case that calls the method again. While recursion and iteration can solve the same problems, recursion may be more elegant for some problems while iteration is more efficient. The document also discusses analyzing the efficiency of recursive algorithms.
Machine learning can help understand how students learn by discovering latent variables that represent skills and knowledge components. The document discusses using machine learning techniques like principal components analysis on student step-by-step data to learn a "step map" representing similarities between skills. It also discusses using cognitive models with probabilistic rules to better represent student problem-solving and incorporate representation learning. The goal is to use machine learning to gain insights about educational content and student learning that can help improve educational materials and tutoring systems.
1. Neural networks are a type of machine learning model that can learn highly non-linear functions to map inputs to outputs. They consist of interconnected layers of nodes that mimic biological neurons.
2. Backpropagation is an algorithm that allows neural networks to be trained using gradient descent by efficiently computing the gradient of the loss function with respect to the network parameters. It works by propagating gradients from the output layer back through the network using the chain rule.
3. There are many design decisions that go into building a neural network architecture, such as the number of hidden layers and nodes, choice of activation functions, objective function, and training algorithm like stochastic gradient descent. Common activation functions are the sigmoid, tanh, and rectified linear
This document summarizes a lecture on 3D vision and shape representations. It discusses various ways to represent 3D shapes, including point clouds, meshes, voxels, implicit surfaces, and parametric surfaces. It also covers recent datasets created for 3D objects, object parts, indoor scenes, and how neural networks can be applied to these representations for tasks like classification, generation, and reconstruction. Representation selection depends on the specific application and tradeoffs between flexibility, memory usage, and supporting different operations. Recent work also aims to develop more unified representations that combine advantages of multiple approaches.
The document summarizes the use of attention mechanisms in sequence-to-sequence models. It describes how attention allows the decoder to attend to different parts of the input sequence at each time step by computing a context vector as a weighted sum of the encoder hidden states. This context vector is then used instead of a fixed context vector, addressing the problem of bottlenecking long sequences. The attention weights are learned during backpropagation without explicit supervision.
This document provides information and guidance for students completing the CS231n course project. It discusses project expectations, how to pick a project idea, and deliverables. For project expectations, it notes the open-ended nature but focus on computer vision problems. Sources of inspiration for project ideas include conferences, papers, and previous student projects. Reading papers efficiently involves focusing on abstracts, methods, results rather than linear reading. Deliverables include a proposal, milestone report, final report, and poster presentation. The proposal and milestone report formats are also outlined.
This document discusses lifelong learning and approaches to address it. It begins with reminders for project deadlines and the plan for the lecture, which is to discuss the lifelong learning problem statement, basic approaches, and how to potentially improve upon them. It then reviews problem statements for online learning, multi-task learning, and meta-learning, noting that real-world settings involve sequential learning over time. Common approaches like fine-tuning all data can hurt past performance, while storing all data is infeasible. Meta-learning aims to efficiently learn new tasks from a non-stationary distribution of tasks.
This document discusses various types of real assets as investments, including real estate, precious metals, and collectibles. It describes the advantages and disadvantages of investing in real assets, noting they may outperform during inflation but have liquidity issues. Real estate investments can include residential and commercial properties. Valuation methods for real estate include cost, sales comparison, and income approaches. Real estate can be owned individually or through various partnership and trust structures like REITs. Precious metals like gold and silver traditionally rise during economic uncertainty. Other collectibles include art, antiques, and memorabilia.
The document summarizes the Action-Decision Networks (ADNets) approach for visual object tracking proposed by Yun et al. in 2017. ADNets address problems with previous trackers like fault tolerance, search region size, and use of unlabeled data. It trains on labeled video frames using supervised learning and reinforcement learning on tracking simulations. When confidence is low, it performs re-detection near the predicted box. Experimental results show ADNet achieves state-of-the-art performance while running in real-time.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Introduction to Artificial Neural Networks - PART IV.pdfSasiKala592103
This document discusses shallow neural networks and multi-layer perceptrons. It defines shallow neural networks as having one hidden layer, while deep neural networks have many hidden layers. Multi-layer perceptrons are introduced as a type of shallow neural network that can solve problems like the XOR problem using a hidden layer. The backpropagation algorithm is described as a way to train multi-layer perceptrons by propagating error backwards from the output to update weights between layers. An example is provided to demonstrate how forward and backward passes are used in backpropagation to train a multi-layer perceptron.
Summary:
There are three parts in this presentation.
A. Why do we need Convolutional Neural Network
- Problems we face today
- Solutions for problems
B. LeNet Overview
- The origin of LeNet
- The result after using LeNet model
C. LeNet Techniques
- LeNet structure
- Function of every layer
In the following Github Link, there is a repository that I rebuilt LeNet without any deep learning package. Hope this can make you more understand the basic of Convolutional Neural Network.
Github Link : https://github.com/HiCraigChen/LeNet
LinkedIn : https://www.linkedin.com/in/YungKueiChen
How can we apply machine learning techniques on graphs to obtain predictions in a variety of domains? Know more from Sami Abu-El-Haija, an AI Scientist with experience from both industry (Google Research) and academia (University of Southern California).
Deep Learning Hardware: Past, Present, & FutureRouyun Pan
Yann LeCun gave a presentation on deep learning hardware, past, present, and future. Some key points:
- Early neural networks in the 1960s-1980s were limited by hardware and algorithms. The development of backpropagation and faster floating point hardware enabled modern deep learning.
- Convolutional neural networks achieved breakthroughs in vision tasks in the 1980s-1990s but progress slowed due to limited hardware and data.
- GPUs and large datasets like ImageNet accelerated deep learning research starting in 2012, enabling very deep convolutional networks for computer vision.
- Recent work applies deep learning to new domains like natural language processing, reinforcement learning, and graph networks.
- Future challenges include memory-aug
The document discusses recursion, which is a programming technique where a method calls itself to fulfill its purpose. It provides examples of recursion in math concepts like factorials and describes how to implement recursion in Java methods. Recursion involves a base case, which terminates the recursive calls, and a recursive case that calls the method again. While recursion and iteration can solve the same problems, recursion may be more elegant for some problems while iteration is more efficient. The document also discusses analyzing the efficiency of recursive algorithms.
Machine learning can help understand how students learn by discovering latent variables that represent skills and knowledge components. The document discusses using machine learning techniques like principal components analysis on student step-by-step data to learn a "step map" representing similarities between skills. It also discusses using cognitive models with probabilistic rules to better represent student problem-solving and incorporate representation learning. The goal is to use machine learning to gain insights about educational content and student learning that can help improve educational materials and tutoring systems.
1. Neural networks are a type of machine learning model that can learn highly non-linear functions to map inputs to outputs. They consist of interconnected layers of nodes that mimic biological neurons.
2. Backpropagation is an algorithm that allows neural networks to be trained using gradient descent by efficiently computing the gradient of the loss function with respect to the network parameters. It works by propagating gradients from the output layer back through the network using the chain rule.
3. There are many design decisions that go into building a neural network architecture, such as the number of hidden layers and nodes, choice of activation functions, objective function, and training algorithm like stochastic gradient descent. Common activation functions are the sigmoid, tanh, and rectified linear
This document summarizes a lecture on 3D vision and shape representations. It discusses various ways to represent 3D shapes, including point clouds, meshes, voxels, implicit surfaces, and parametric surfaces. It also covers recent datasets created for 3D objects, object parts, indoor scenes, and how neural networks can be applied to these representations for tasks like classification, generation, and reconstruction. Representation selection depends on the specific application and tradeoffs between flexibility, memory usage, and supporting different operations. Recent work also aims to develop more unified representations that combine advantages of multiple approaches.
The document summarizes the use of attention mechanisms in sequence-to-sequence models. It describes how attention allows the decoder to attend to different parts of the input sequence at each time step by computing a context vector as a weighted sum of the encoder hidden states. This context vector is then used instead of a fixed context vector, addressing the problem of bottlenecking long sequences. The attention weights are learned during backpropagation without explicit supervision.
This document provides information and guidance for students completing the CS231n course project. It discusses project expectations, how to pick a project idea, and deliverables. For project expectations, it notes the open-ended nature but focus on computer vision problems. Sources of inspiration for project ideas include conferences, papers, and previous student projects. Reading papers efficiently involves focusing on abstracts, methods, results rather than linear reading. Deliverables include a proposal, milestone report, final report, and poster presentation. The proposal and milestone report formats are also outlined.
This document discusses lifelong learning and approaches to address it. It begins with reminders for project deadlines and the plan for the lecture, which is to discuss the lifelong learning problem statement, basic approaches, and how to potentially improve upon them. It then reviews problem statements for online learning, multi-task learning, and meta-learning, noting that real-world settings involve sequential learning over time. Common approaches like fine-tuning all data can hurt past performance, while storing all data is infeasible. Meta-learning aims to efficiently learn new tasks from a non-stationary distribution of tasks.
This document discusses various types of real assets as investments, including real estate, precious metals, and collectibles. It describes the advantages and disadvantages of investing in real assets, noting they may outperform during inflation but have liquidity issues. Real estate investments can include residential and commercial properties. Valuation methods for real estate include cost, sales comparison, and income approaches. Real estate can be owned individually or through various partnership and trust structures like REITs. Precious metals like gold and silver traditionally rise during economic uncertainty. Other collectibles include art, antiques, and memorabilia.
This document discusses measuring risks and returns of portfolio managers. It covers three key performance measures: the Treynor measure, Sharpe measure, and Jensen's Alpha. The Sharpe measure evaluates reward per unit of total risk, the Treynor measure evaluates reward per unit of beta risk, and Jensen's Alpha measures actual returns against the CAPM model. The document also discusses how investors apply modern portfolio theory, with some believing markets are strongly efficient so only a naïve diversified portfolio is needed, while others believe in semi-strong efficiency and analyze stocks for a diversified growth portfolio. Diversification, long-term performance measurement, and balancing growth versus diversification are important implications for investors.
People invest to accumulate funds for specific purposes like wealth creation, financial security, and retirement planning. Investments require sacrificing something today for future returns. There are real assets like real estate and equipment, and financial assets that are claims against real assets like stocks and bonds. Risk and expected return are directly related - higher risk investments are expected to provide higher returns. Setting clear investment objectives involves considering factors like risk tolerance, time horizon, need for income versus capital appreciation, tax implications, and more. The relationship between risk and return is a key principle in investing.
The document provides an investor presentation for Thoughtworks' Q1 2022 results. It includes forward-looking statements, non-GAAP financial measures, and industry and market data. Thoughtworks has over 11,000 employees globally, with revenues of $321 million in Q1 2022, up 35% year-over-year. Adjusted EBITDA was $73 million, with an adjusted EBITDA margin of 22.7%.
Amine solvent development for carbon dioxide capture by yang du, 2016 (doctor...Kuan-Tsae Huang
This dissertation examines the development of amine solvents for carbon dioxide capture from flue gas. 36 novel piperazine-based amine blends were screened for their thermal degradation, amine volatility, CO2 cyclic capacity, and CO2 absorption rate. 18 thermally stable blends were identified. A group contribution model was developed to predict amine volatility. The optimum pKa of a tertiary amine in a piperazine/tertiary amine blend for highest CO2 capacity was around 9.1. A generic Aspen Plus model was developed to predict CO2 absorption based on tertiary amine pKa. 2 m piperazine/3 m 4-hydroxy-1-methylpiperidine showed
A novel approach to carbon dioxide capture and storage by brett p. spigarelli...Kuan-Tsae Huang
The review provides a critical analysis of the major technologies for capturing carbon dioxide from fossil fuel power plants, including post-combustion capture, pre-combustion capture, oxy-combustion, and chemical looping combustion. Each technology has advantages and disadvantages and are at different stages of development. Fossil fuel power plants are currently the largest point source of carbon dioxide emissions, accounting for roughly 40% of total emissions, making them a logical target for implementing carbon capture technologies to reduce emissions.
Capture and storage—a compendium of canada’s participation, 2006Kuan-Tsae Huang
This document provides a compendium of organizations and projects related to carbon dioxide capture and storage (CCS) in Canada. It profiles 83 organizations involved in CCS coordination, planning, research, industry, government agencies, and funding programs. It also summarizes 126 CCS-related science and engineering projects conducted between 2003-2005. The projects focus mainly on CO2 capture, transportation, and storage, as well as characterizing emission sources and geological sinks. Universities lead most projects, followed by government research agencies and industry. The compendium aims to identify Canadian resources dedicated to CCS and demonstrate the country's commitment to reducing CO2 emissions.
Determination of Equivalent Circuit parameters and performance characteristic...pvpriya2
Includes the testing of induction motor to draw the circle diagram of induction motor with step wise procedure and calculation for the same. Also explains the working and application of Induction generator
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...DharmaBanothu
The Network on Chip (NoC) has emerged as an effective
solution for intercommunication infrastructure within System on
Chip (SoC) designs, overcoming the limitations of traditional
methods that face significant bottlenecks. However, the complexity
of NoC design presents numerous challenges related to
performance metrics such as scalability, latency, power
consumption, and signal integrity. This project addresses the
issues within the router's memory unit and proposes an enhanced
memory structure. To achieve efficient data transfer, FIFO buffers
are implemented in distributed RAM and virtual channels for
FPGA-based NoC. The project introduces advanced FIFO-based
memory units within the NoC router, assessing their performance
in a Bi-directional NoC (Bi-NoC) configuration. The primary
objective is to reduce the router's workload while enhancing the
FIFO internal structure. To further improve data transfer speed,
a Bi-NoC with a self-configurable intercommunication channel is
suggested. Simulation and synthesis results demonstrate
guaranteed throughput, predictable latency, and equitable
network access, showing significant improvement over previous
designs
Road construction is not as easy as it seems to be, it includes various steps and it starts with its designing and
structure including the traffic volume consideration. Then base layer is done by bulldozers and levelers and after
base surface coating has to be done. For giving road a smooth surface with flexibility, Asphalt concrete is used.
Asphalt requires an aggregate sub base material layer, and then a base layer to be put into first place. Asphalt road
construction is formulated to support the heavy traffic load and climatic conditions. It is 100% recyclable and
saving non renewable natural resources.
With the advancement of technology, Asphalt technology gives assurance about the good drainage system and with
skid resistance it can be used where safety is necessary such as outsidethe schools.
The largest use of Asphalt is for making asphalt concrete for road surfaces. It is widely used in airports around the
world due to the sturdiness and ability to be repaired quickly, it is widely used for runways dedicated to aircraft
landing and taking off. Asphalt is normally stored and transported at 150’C or 300’F temperature
Applications of artificial Intelligence in Mechanical Engineering.pdfAtif Razi
Historically, mechanical engineering has relied heavily on human expertise and empirical methods to solve complex problems. With the introduction of computer-aided design (CAD) and finite element analysis (FEA), the field took its first steps towards digitization. These tools allowed engineers to simulate and analyze mechanical systems with greater accuracy and efficiency. However, the sheer volume of data generated by modern engineering systems and the increasing complexity of these systems have necessitated more advanced analytical tools, paving the way for AI.
AI offers the capability to process vast amounts of data, identify patterns, and make predictions with a level of speed and accuracy unattainable by traditional methods. This has profound implications for mechanical engineering, enabling more efficient design processes, predictive maintenance strategies, and optimized manufacturing operations. AI-driven tools can learn from historical data, adapt to new information, and continuously improve their performance, making them invaluable in tackling the multifaceted challenges of modern mechanical engineering.
Generative AI Use cases applications solutions and implementation.pdfmahaffeycheryld
Generative AI solutions encompass a range of capabilities from content creation to complex problem-solving across industries. Implementing generative AI involves identifying specific business needs, developing tailored AI models using techniques like GANs and VAEs, and integrating these models into existing workflows. Data quality and continuous model refinement are crucial for effective implementation. Businesses must also consider ethical implications and ensure transparency in AI decision-making. Generative AI's implementation aims to enhance efficiency, creativity, and innovation by leveraging autonomous generation and sophisticated learning algorithms to meet diverse business challenges.
https://www.leewayhertz.com/generative-ai-use-cases-and-applications/
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELijaia
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Open Channel Flow: fluid flow with a free surfaceIndrajeet sahu
Open Channel Flow: This topic focuses on fluid flow with a free surface, such as in rivers, canals, and drainage ditches. Key concepts include the classification of flow types (steady vs. unsteady, uniform vs. non-uniform), hydraulic radius, flow resistance, Manning's equation, critical flow conditions, and energy and momentum principles. It also covers flow measurement techniques, gradually varied flow analysis, and the design of open channels. Understanding these principles is vital for effective water resource management and engineering applications.
Levelised Cost of Hydrogen (LCOH) Calculator ManualMassimo Talia
The aim of this manual is to explain the
methodology behind the Levelized Cost of
Hydrogen (LCOH) calculator. Moreover, this
manual also demonstrates how the calculator
can be used for estimating the expenses associated with hydrogen production in Europe
using low-temperature electrolysis considering different sources of electricity
Supermarket Management System Project Report.pdfKamal Acharya
Supermarket management is a stand-alone J2EE using Eclipse Juno program.
This project contains all the necessary required information about maintaining
the supermarket billing system.
The core idea of this project to minimize the paper work and centralize the
data. Here all the communication is taken in secure manner. That is, in this
application the information will be stored in client itself. For further security the
data base is stored in the back-end oracle and so no intruders can access it.
Digital Twins Computer Networking Paper Presentation.pptxaryanpankaj78
A Digital Twin in computer networking is a virtual representation of a physical network, used to simulate, analyze, and optimize network performance and reliability. It leverages real-time data to enhance network management, predict issues, and improve decision-making processes.
OOPS_Lab_Manual - programs using C++ programming language
PQS_7_ruohan.pdf
1. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Lecture 7:
Training Neural Networks
1
2. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Administrative: A2
A2 is out, due Monday May 2nd, 11:59pm
2
3. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Where we are now...
Linear score function:
2-layer Neural Network
x h
W1 s
W2
3072 100 10
Neural Networks
3
4. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Where we are now...
Convolutional Neural Networks
4
5. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Where we are now...
Convolutional Layer
32
32
3
32x32x3 image
5x5x3 filter
convolve (slide) over all
spatial locations
activation map
1
28
28
5
6. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Where we are now...
Convolutional Layer
32
32
3
Convolution Layer
activation maps
6
28
28
For example, if we had 6 5x5 filters, we’ll
get 6 separate activation maps:
We stack these up to get a “new image” of size 28x28x6!
6
7. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
7
Where we are now...
CNN Architectures
8. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Where we are now...
Landscape image is CC0 1.0 public domain
Walking man image is CC0 1.0 public domain
Learning network parameters through optimization
8
9. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Where we are now...
Mini-batch SGD
Loop:
1. Sample a batch of data
2. Forward prop it through the graph
(network), get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient
9
10. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Today: Training Neural Networks
10
11. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
11
1. One time set up: activation functions, preprocessing,
weight initialization, regularization, gradient checking
2. Training dynamics: babysitting the learning process,
parameter updates, hyperparameter optimization
3. Evaluation: model ensembles, test-time
augmentation, transfer learning
Overview
15. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
15
16. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
3 problems:
1. Saturated neurons “kill” the
gradients
16
17. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
sigmoid
gate
x
17
18. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
sigmoid
gate
x
What happens when x = -10?
18
19. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
sigmoid
gate
x
What happens when x = -10?
19
20. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
sigmoid
gate
x
What happens when x = -10?
What happens when x = 0?
20
21. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
sigmoid
gate
x
What happens when x = -10?
What happens when x = 0?
What happens when x = 10?
21
22. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
sigmoid
gate
x
What happens when x = -10?
What happens when x = 0?
What happens when x = 10?
22
23. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Why is this a problem?
If all the gradients flowing back will be
zero and weights will never change
sigmoid
gate
x
23
24. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
3 problems:
1. Saturated neurons “kill” the
gradients
2. Sigmoid outputs are not
zero-centered
24
25. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
25
26. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
26
27. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
27
We know that local gradient of sigmoid is always positive
28. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
We know that local gradient of sigmoid is always positive
We are assuming x is always positive
28
29. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
We know that local gradient of sigmoid is always positive
We are assuming x is always positive
So!! Sign of gradient for all wi
is the same as the sign of upstream scalar gradient!
29
30. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
Always all positive or all negative :(
hypothetical
optimal w
vector
zig zag path
allowed
gradient
update
directions
allowed
gradient
update
directions
30
31. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
31
Consider what happens when the input to a neuron is
always positive...
What can we say about the gradients on w?
Always all positive or all negative :(
(For a single element! Minibatches help)
hypothetical
optimal w
vector
zig zag path
allowed
gradient
update
directions
allowed
gradient
update
directions
31
32. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Sigmoid
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
3 problems:
1. Saturated neurons “kill” the
gradients
2. Sigmoid outputs are not
zero-centered
3. exp() is a bit compute expensive
32
33. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
tanh(x)
- Squashes numbers to range [-1,1]
- zero centered (nice)
- still kills gradients when saturated :(
[LeCun et al., 1991]
33
34. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions - Computes f(x) = max(0,x)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
ReLU
(Rectified Linear Unit)
[Krizhevsky et al., 2012]
34
35. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
ReLU
(Rectified Linear Unit)
- Computes f(x) = max(0,x)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
- Not zero-centered output
35
36. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
ReLU
(Rectified Linear Unit)
- Computes f(x) = max(0,x)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
- Not zero-centered output
- An annoyance:
hint: what is the gradient when x < 0?
36
37. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
ReLU
gate
x
What happens when x = -10?
What happens when x = 0?
What happens when x = 10?
37
38. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
DATA CLOUD
active ReLU
dead ReLU
will never activate
=> never update
38
39. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
DATA CLOUD
active ReLU
dead ReLU
will never activate
=> never update
=> people like to initialize
ReLU neurons with slightly
positive biases (e.g. 0.01)
39
40. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Leaky ReLU
- Does not saturate
- Computationally efficient
- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
- will not “die”.
[Mass et al., 2013]
[He et al., 2015]
40
41. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Leaky ReLU
- Does not saturate
- Computationally efficient
- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
- will not “die”.
Parametric Rectifier (PReLU)
[Mass et al., 2013]
[He et al., 2015]
41
42. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Exponential Linear Units (ELU)
- All benefits of ReLU
- Closer to zero mean outputs
- Negative saturation regime
compared with Leaky ReLU
adds some robustness to noise
- Computation requires exp()
[Clevert et al., 2015]
42
(Alpha default = 1)
43. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Activation Functions
Scaled Exponential Linear Units (SELU)
- Scaled version of ELU that
works better for deep networks
- “Self-normalizing” property;
- Can train deep SELU networks
without BatchNorm
[Klambauer et al. ICLR 2017]
43
44. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Maxout “Neuron”
- Does not have the basic form of dot product ->
nonlinearity
- Generalizes ReLU and Leaky ReLU
- Linear Regime! Does not saturate! Does not die!
Problem: doubles the number of parameters/neuron :(
[Goodfellow et al., 2013]
44
45. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
TLDR: In practice:
- Use ReLU. Be careful with your learning rates
- Try out Leaky ReLU / Maxout / ELU / SELU
- To squeeze out some marginal gains
- Don’t use sigmoid or tanh
45
46. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Data Preprocessing
46
47. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Data Preprocessing
(Assume X [NxD] is data matrix,
each example in a row)
47
48. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Remember: Consider what happens when
the input to a neuron is always positive...
What can we say about the gradients on w?
Always all positive or all negative :(
(this is also why you want zero-mean data!)
hypothetical
optimal w
vector
zig zag path
allowed
gradient
update
directions
allowed
gradient
update
directions
48
49. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
(Assume X [NxD] is data matrix, each example in a row)
Data Preprocessing
49
50. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Data Preprocessing
In practice, you may also see PCA and Whitening of the data
(data has diagonal
covariance matrix)
(covariance matrix is the
identity matrix)
50
51. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Data Preprocessing
Before normalization: classification loss
very sensitive to changes in weight matrix;
hard to optimize
After normalization: less sensitive to small
changes in weights; easier to optimize
51
52. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
TLDR: In practice for Images: center only
- Subtract the mean image (e.g. AlexNet)
(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet)
(mean along each channel = 3 numbers)
- Subtract per-channel mean and
Divide by per-channel std (e.g. ResNet)
(mean along each channel = 3 numbers)
e.g. consider CIFAR-10 example with [32,32,3] images
Not common
to do PCA or
whitening
52
54. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
- Q: what happens when W=constant init is used?
54
55. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
55
56. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
- First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)
Works ~okay for small networks, but problems with
deeper networks.
56
57. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: Activation statistics
Forward pass for a 6-layer
net with hidden size 4096
57
What will happen to the activations for the last layer?
58. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: Activation statistics
Forward pass for a 6-layer
net with hidden size 4096
All activations tend to zero
for deeper network layers
Q: What do the gradients
dL/dW look like?
58
59. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: Activation statistics
Forward pass for a 6-layer
net with hidden size 4096
All activations tend to zero
for deeper network layers
Q: What do the gradients
dL/dW look like?
A: All zero, no learning =(
59
60. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: Activation statistics
Increase std of initial
weights from 0.01 to 0.05
60
What will happen to the activations for the last layer?
61. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: Activation statistics
Increase std of initial
weights from 0.01 to 0.05
All activations saturate
Q: What do the gradients
look like?
61
62. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: Activation statistics
Increase std of initial
weights from 0.01 to 0.05
All activations saturate
Q: What do the gradients
look like?
A: Local gradients all zero,
no learning =(
62
63. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: “Xavier” Initialization
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
63
64. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Just right”: Activations are
nicely scaled for all layers!
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
64
65. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
65
66. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
66
67. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
67
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
68. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
68
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
We want: Var(y) = Var(xi
)
69. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Var(y) = Var(x1
w1
+x2
w2
+...+xDin
wDin
)
[substituting value of y]
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
69
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
We want: Var(y) = Var(xi
)
70. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Var(y) = Var(x1
w1
+x2
w2
+...+xDin
wDin
)
= Din Var(xi
wi
)
[Assume all xi
, wi
are iid]
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
70
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
We want: Var(y) = Var(xi
)
71. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Var(y) = Var(x1
w1
+x2
w2
+...+xDin
wDin
)
= Din Var(xi
wi
)
= Din Var(xi
) Var(wi
)
[Assume all xi
, wi
are zero mean]
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
71
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
We want: Var(y) = Var(xi
)
72. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
“Xavier” initialization:
std = 1/sqrt(Din)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Weight Initialization: “Xavier” Initialization
Var(y) = Var(x1
w1
+x2
w2
+...+xDin
wDin
)
= Din Var(xi
wi
)
= Din Var(xi
) Var(wi
)
[Assume all xi
, wi
are iid]
Let: y = x1
w1
+x2
w2
+...+xDin
wDin
“Just right”: Activations are
nicely scaled for all layers!
For conv layers, Din is
filter_size2
* input_channels
72
Assume: Var(x1
) = Var(x2
)= …=Var(xDin
)
We want: Var(y) = Var(xi
)
So, Var(y) = Var(xi
) only when Var(wi
) = 1/Din
73. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: What about ReLU?
Change from tanh to ReLU
73
74. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: What about ReLU?
Xavier assumes zero
centered activation function
Activations collapse to zero
again, no learning =(
Change from tanh to ReLU
74
75. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Weight Initialization: Kaiming / MSRA Initialization
ReLU correction: std = sqrt(2 / Din)
He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015
“Just right”: Activations are
nicely scaled for all layers!
75
76. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Proper initialization is an active area of research…
Understanding the difficulty of training deep feedforward neural networks
by Glorot and Bengio, 2010
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et
al., 2015
Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015
All you need is a good init, Mishkin and Matas, 2015
Fixup Initialization: Residual Learning Without Normalization, Zhang et al, 2019
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, Frankle and Carbin, 2019
76
77. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Training vs. Testing Error
77
78. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
78
Beyond Training Error
Better optimization algorithms
help reduce training loss
But we really care about error on
new data - how to reduce the gap?
79. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
79
Early Stopping: Always do this
Iteration
Loss
Iteration
Accuracy
Train
Val
Stop training here
Stop training the model when accuracy on the validation set decreases
Or train for a long time, but always keep track of the model snapshot
that worked best on val
80. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
80
1. Train multiple independent models
2. At test time average their results
(Take average of predicted probability distributions, then choose argmax)
Enjoy 2% extra performance
Model Ensembles
81. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
81
How to improve single-model performance?
Regularization
82. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Regularization: Add term to loss
82
In common use:
L2 regularization
L1 regularization
Elastic net (L1 + L2)
(Weight decay)
83. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
83
Regularization: Dropout
In each forward pass, randomly set some neurons to zero
Probability of dropping is a hyperparameter; 0.5 is common
Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014
84. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
84
Regularization: Dropout Example forward
pass with a
3-layer network
using dropout
85. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
85
Regularization: Dropout
How can this possibly be a good idea?
Forces the network to have a redundant representation;
Prevents co-adaptation of features
has an ear
has a tail
is furry
has claws
mischievous
look
cat
score
X
X
X
86. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
86
Regularization: Dropout
How can this possibly be a good idea?
Another interpretation:
Dropout is training a large ensemble of
models (that share parameters).
Each binary mask is one model
An FC layer with 4096 units has
24096
~ 101233
possible masks!
Only ~ 1082
atoms in the universe...
87. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
87
Dropout: Test time
Dropout makes our output random!
Output
(label)
Input
(image)
Random
mask
Want to “average out” the randomness at test-time
But this integral seems hard …
88. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
88
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
a
x y
w1 w2
89. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
89
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
At test time we have:
a
x y
w1 w2
90. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
90
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
At test time we have:
During training we have:
a
x y
w1 w2
91. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
91
Dropout: Test time
Want to approximate
the integral
Consider a single neuron.
At test time we have:
During training we have:
a
x y
w1 w2
At test time, multiply
by dropout probability
92. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
92
Dropout: Test time
At test time all neurons are active always
=> We must scale the activations so that for each neuron:
output at test time = expected output at training time
93. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
93
Dropout Summary
drop in train time
scale at test time
94. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
94
More common: “Inverted dropout”
test time is unchanged!
95. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
95
Regularization: A common pattern
Training: Add some kind of randomness
Testing: Average out randomness (sometimes approximate)
96. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
96
Regularization: A common pattern
Training: Add some kind
of randomness
Testing: Average out randomness
(sometimes approximate)
Example: Batch
Normalization
Training:
Normalize using
stats from random
minibatches
Testing: Use fixed
stats to normalize
97. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
97
Load image
and label
“cat”
CNN
Compute
loss
Regularization: Data Augmentation
This image by Nikita is
licensed under CC-BY 2.0
98. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
98
Regularization: Data Augmentation
Load image
and label
“cat”
CNN
Compute
loss
Transform image
99. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
99
Data Augmentation
Horizontal Flips
100. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
100
Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch
101. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
101
Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch
Testing: average a fixed set of crops
ResNet:
1. Resize image at 5 scales: {224, 256, 384, 480, 640}
2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips
102. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
102
Data Augmentation
Color Jitter
Simple: Randomize
contrast and brightness
103. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
103
Data Augmentation
Color Jitter
Simple: Randomize
contrast and brightness
More Complex:
1. Apply PCA to all [R, G, B]
pixels in training set
2. Sample a “color offset”
along principal component
directions
3. Add offset to all pixels of a
training image
(As seen in [Krizhevsky et al. 2012], ResNet, etc)
104. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
104
Data Augmentation
Get creative for your problem!
Examples of data augmentations:
- translation
- rotation
- stretching
- shearing,
- lens distortions, … (go crazy)
105. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
105
Automatic Data Augmentation
Cubuk et al., “AutoAugment: Learning Augmentation Strategies from Data”, CVPR 2019
106. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
106
Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation
107. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
107
Regularization: DropConnect
Training: Drop connections between neurons (set weights to 0)
Testing: Use all the connections
Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
108. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
108
Regularization: Fractional Pooling
Training: Use randomized pooling regions
Testing: Average predictions from several regions
Graham, “Fractional Max Pooling”, arXiv 2014
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
109. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
109
Regularization: Stochastic Depth
Training: Skip some layers in the network
Testing: Use all the layer
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth
Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016
110. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
110
Regularization: Cutout
Training: Set random image regions to zero
Testing: Use full image
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth
Cutout / Random Crop
DeVries and Taylor, “Improved Regularization of
Convolutional Neural Networks with Cutout”, arXiv 2017
Works very well for small datasets like CIFAR,
less common for large datasets like ImageNet
111. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
111
Regularization: Mixup
Training: Train on random blends of images
Testing: Use original images
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth
Cutout / Random Crop
Mixup
Zhang et al, “mixup: Beyond Empirical Risk Minimization”, ICLR 2018
Randomly blend the pixels
of pairs of training images,
e.g. 40% cat, 60% dog
CNN
Target label:
cat: 0.4
dog: 0.6
112. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
112
Regularization - In practice
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth
Cutout / Random Crop
Mixup
- Consider dropout for large
fully-connected layers
- Batch normalization and data
augmentation almost always a
good idea
- Try cutout and mixup especially
for small classification datasets
113. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
113
Choosing Hyperparameters
(without tons of GPUs)
114. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
114
Choosing Hyperparameters
Step 1: Check initial loss
Turn off weight decay, sanity check loss at initialization
e.g. log(C) for softmax with C classes
115. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
115
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Try to train to 100% training accuracy on a small sample of
training data (~5-10 minibatches); fiddle with architecture,
learning rate, weight initialization
Loss not going down? LR too low, bad initialization
Loss explodes to Inf or NaN? LR too high, bad initialization
116. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
116
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Use the architecture from the previous step, use all training
data, turn on small weight decay, find a learning rate that
makes the loss drop significantly within ~100 iterations
Good learning rates to try: 1e-1, 1e-2, 1e-3, 1e-4
117. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
117
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Choose a few values of learning rate and weight decay around
what worked from Step 3, train a few models for ~1-5 epochs.
Good weight decay to try: 1e-4, 1e-5, 0
118. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
118
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Pick best models from Step 4, train them for longer (~10-20
epochs) without learning rate decay
119. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
119
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Step 6: Look at loss and accuracy curves
120. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
120
Accuracy
time
Train
Accuracy still going up, you
need to train longer
Val
121. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
121
Accuracy
time
Train
Huge train / val gap means
overfitting! Increase regularization,
get more data
Val
122. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
122
Accuracy
time
Train
No gap between train / val means
underfitting: train longer, use a
bigger model
Val
123. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
123
Losses may be noisy, use a
scatter plot and also plot moving
average to see trends better
Look at learning curves!
Training Loss Train / Val Accuracy
124. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
124
Cross-validation
We develop "command
centers" to visualize all our
models training with different
hyperparameters
check out weights and biases
125. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
125
You can plot all your loss curves for different hyperparameters on a single plot
126. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
126
Don't look at accuracy or loss curves for too long!
127. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
127
Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Step 6: Look at loss and accuracy curves
Step 7: GOTO step 5
128. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
128
Random Search vs. Grid Search
Important Parameter Important Parameter
Unimportant
Parameter
Unimportant
Parameter
Grid Layout Random Layout
Illustration of Bergstra et al., 2012 by Shayne
Longpre, copyright CS231n 2017
Random Search for
Hyper-Parameter Optimization
Bergstra and Bengio, 2012
129. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
129
Summary
- Improve your training error:
- Optimizers
- Learning rate schedules
- Improve your test error:
- Regularization
- Choosing Hyperparameters
130. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
Summary
We looked in detail at:
- Activation Functions (use ReLU)
- Data Preprocessing (images: subtract mean)
- Weight Initialization (use Xavier/He init)
- Batch Normalization (use this!)
- Transfer learning (use this if you can!)
TLDRs
130
131. Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 7 - April 19, 2022
131
Next time: Visualizing and Understanding