The document summarizes the author's computer vision research from 2020 to the present. It covers areas of research including image segmentation, 3D reconstruction, image restoration, and lip generation. Specific projects are mentioned under each area, such as YOLACT and MODNet for image segmentation, PIFu and SMPL for 3D reconstruction, and Wav2Lip and SyncTalkFace for lip generation from speech. The author also outlines plans for future research directions involving multimodal learning, generative models, and representing scenes with neural radiance fields.
발표자: 이활석 (Naver Clova)
발표일: 2017.11.
(현) NAVER Clova Vision
(현) TFKR 운영진
개요:
최근 딥러닝 연구는 지도학습에서 비지도학습으로 급격히 무게 중심이 옮겨지고 있습니다.
특히 컴퓨터 비전 기술 분야에서는 지도학습에 해당하는 이미지 내에 존재하는 정보를 찾는 인식 기술에서,
비지도학습에 해당하는 특정 정보를 담는 이미지를 생성하는 기술인 생성 기술로 연구 동향이 바뀌어 가고 있습니다.
본 세미나에서는 생성 기술의 두 축을 담당하고 있는 VAE(variational autoencoder)와 GAN(generative adversarial network) 동작 원리에 대해서 간략히 살펴 보고, 관련된 주요 논문들의 결과를 공유하고자 합니다.
딥러닝에 대한 지식이 없더라도 생성 모델을 학습할 수 있는 두 방법론인 VAE와 GAN의 개념에 대해 이해하고
그 기술 수준을 파악할 수 있도록 강의 내용을 구성하였습니다.
Super resolution in deep learning era - Jaejun YooJaeJun Yoo
1) The document discusses super-resolution techniques in deep learning, including inverse problems, image restoration problems, and different deep learning models.
2) Early models like SRCNN used convolutional networks for super-resolution but were shallow, while later models incorporated residual learning (VDSR), recursive learning (DRCN), and became very deep and dense (SRResNet).
3) Key developments included EDSR which provided a strong backbone model and GAN-based approaches like SRGAN which aimed to generate more realistic textures but require new evaluation metrics.
Anomaly detection using deep one class classifier홍배 김
The document discusses anomaly detection techniques using deep one-class classifiers and generative adversarial networks (GANs). It proposes using an autoencoder to extract features from normal images, training a GAN on those features to model the distribution, and using a one-class support vector machine (SVM) to determine if new images are within the normal distribution. The method detects and localizes anomalies by generating a binary mask for abnormal regions. It also discusses Gaussian mixture models and the expectation-maximization algorithm for modeling multiple distributions in data.
Deep learning-for-pose-estimation-wyang-defenseWei Yang
This document summarizes a thesis proposal on using deep learning for articulated human pose estimation. The proposed method uses a deep convolutional neural network (DCNN) as a front-end to extract local appearance features of body parts, combined with message passing layers to model spatial relationships between parts through pairwise constraints. This global pose model is trained end-to-end using a max-sum algorithm to maximize consistency across the entire human pose. Experimental results on standard pose estimation datasets demonstrate state-of-the-art performance.
Generating super resolution images using transformersNEERAJ BAGHEL
The document summarizes a research paper on using transformers for the task of natural language processing. Some key points:
- Transformers use attention mechanisms to draw global dependencies between input and output without regard to sequence length, addressing limitations of RNNs and CNNs for NLP tasks.
- The proposed transformer architecture contains self-attention layers in the encoder and decoder, as well as an attention mechanism between the encoder and decoder.
- The transformer uses scaled dot-product attention and multi-head attention. Self-attention allows relating different positions of a single sequence to compute representations.
- Other components include feedforward layers and positional encoding to inject information about the relative or absolute positions of the tokens in the sequence
Neural Inverse Rendering for General Reflectance Photometric Stereo (ICML 2018)Tatsunori Taniai
Our physics-embedded neural network approach to photometric stereo:
1) Uses an autoencoder with two streams - one to estimate surface normals from images and another to reconstruct images using the estimated normals.
2) Implements an unsupervised learning approach using a reconstruction loss function without needing ground truth surface normal data.
3) Incorporates a weak supervision prior in early training to stabilize learning, which is removed later.
4) Outperforms other methods on real-world scenes, achieving state-of-the-art results for general reflectance photometric stereo.
This document describes six MATLAB projects available from VenSoft Technologies for the 2014-2015 academic year. It provides the titles, abstracts, publishing details, and index terms for each project. The projects involve topics such as sparse unmixing of hyperspectral data, mixed noise removal, subspace matching pursuit, exploiting spectral a priori information, gradient histogram preservation for image denoising, and image set-based collaborative representation for face recognition.
This document proposes using a deep belief network (DBN) to learn depth perception from optical flow information. It describes:
1) Using motion parallax and optical flow cues to perceive depth in humans and insects.
2) Generating labeled training data from 3D graphics scenes to teach the DBN the mapping from motion to depth.
3) The DBN architecture, which takes motion energy maps as input and uses multiple hidden layers and backpropagation to predict depth maps.
4) Test results showing the DBN achieves a higher R^2 score for depth prediction than other models like linear regression.
발표자: 이활석 (Naver Clova)
발표일: 2017.11.
(현) NAVER Clova Vision
(현) TFKR 운영진
개요:
최근 딥러닝 연구는 지도학습에서 비지도학습으로 급격히 무게 중심이 옮겨지고 있습니다.
특히 컴퓨터 비전 기술 분야에서는 지도학습에 해당하는 이미지 내에 존재하는 정보를 찾는 인식 기술에서,
비지도학습에 해당하는 특정 정보를 담는 이미지를 생성하는 기술인 생성 기술로 연구 동향이 바뀌어 가고 있습니다.
본 세미나에서는 생성 기술의 두 축을 담당하고 있는 VAE(variational autoencoder)와 GAN(generative adversarial network) 동작 원리에 대해서 간략히 살펴 보고, 관련된 주요 논문들의 결과를 공유하고자 합니다.
딥러닝에 대한 지식이 없더라도 생성 모델을 학습할 수 있는 두 방법론인 VAE와 GAN의 개념에 대해 이해하고
그 기술 수준을 파악할 수 있도록 강의 내용을 구성하였습니다.
Super resolution in deep learning era - Jaejun YooJaeJun Yoo
1) The document discusses super-resolution techniques in deep learning, including inverse problems, image restoration problems, and different deep learning models.
2) Early models like SRCNN used convolutional networks for super-resolution but were shallow, while later models incorporated residual learning (VDSR), recursive learning (DRCN), and became very deep and dense (SRResNet).
3) Key developments included EDSR which provided a strong backbone model and GAN-based approaches like SRGAN which aimed to generate more realistic textures but require new evaluation metrics.
Anomaly detection using deep one class classifier홍배 김
The document discusses anomaly detection techniques using deep one-class classifiers and generative adversarial networks (GANs). It proposes using an autoencoder to extract features from normal images, training a GAN on those features to model the distribution, and using a one-class support vector machine (SVM) to determine if new images are within the normal distribution. The method detects and localizes anomalies by generating a binary mask for abnormal regions. It also discusses Gaussian mixture models and the expectation-maximization algorithm for modeling multiple distributions in data.
Deep learning-for-pose-estimation-wyang-defenseWei Yang
This document summarizes a thesis proposal on using deep learning for articulated human pose estimation. The proposed method uses a deep convolutional neural network (DCNN) as a front-end to extract local appearance features of body parts, combined with message passing layers to model spatial relationships between parts through pairwise constraints. This global pose model is trained end-to-end using a max-sum algorithm to maximize consistency across the entire human pose. Experimental results on standard pose estimation datasets demonstrate state-of-the-art performance.
Generating super resolution images using transformersNEERAJ BAGHEL
The document summarizes a research paper on using transformers for the task of natural language processing. Some key points:
- Transformers use attention mechanisms to draw global dependencies between input and output without regard to sequence length, addressing limitations of RNNs and CNNs for NLP tasks.
- The proposed transformer architecture contains self-attention layers in the encoder and decoder, as well as an attention mechanism between the encoder and decoder.
- The transformer uses scaled dot-product attention and multi-head attention. Self-attention allows relating different positions of a single sequence to compute representations.
- Other components include feedforward layers and positional encoding to inject information about the relative or absolute positions of the tokens in the sequence
Neural Inverse Rendering for General Reflectance Photometric Stereo (ICML 2018)Tatsunori Taniai
Our physics-embedded neural network approach to photometric stereo:
1) Uses an autoencoder with two streams - one to estimate surface normals from images and another to reconstruct images using the estimated normals.
2) Implements an unsupervised learning approach using a reconstruction loss function without needing ground truth surface normal data.
3) Incorporates a weak supervision prior in early training to stabilize learning, which is removed later.
4) Outperforms other methods on real-world scenes, achieving state-of-the-art results for general reflectance photometric stereo.
This document describes six MATLAB projects available from VenSoft Technologies for the 2014-2015 academic year. It provides the titles, abstracts, publishing details, and index terms for each project. The projects involve topics such as sparse unmixing of hyperspectral data, mixed noise removal, subspace matching pursuit, exploiting spectral a priori information, gradient histogram preservation for image denoising, and image set-based collaborative representation for face recognition.
This document proposes using a deep belief network (DBN) to learn depth perception from optical flow information. It describes:
1) Using motion parallax and optical flow cues to perceive depth in humans and insects.
2) Generating labeled training data from 3D graphics scenes to teach the DBN the mapping from motion to depth.
3) The DBN architecture, which takes motion energy maps as input and uses multiple hidden layers and backpropagation to predict depth maps.
4) Test results showing the DBN achieves a higher R^2 score for depth prediction than other models like linear regression.
Image Interpolation Techniques with Optical and Digital Zoom Conceptsmmjalbiaty
Digital image concepts and interpolation techniques for optical and digital zoom are discussed. There are three main types of interpolation used for resizing images: nearest neighbor, bilinear, and bicubic. Nearest neighbor is the simplest but produces the lowest quality, while bicubic is the most complex but highest quality. Optical zoom uses lens magnification before sensing, whereas digital zoom interpolates after sensing, resulting in lower quality than optical zoom. Interpolation methods assign pixel values to new locations during resizing based on weighting patterns around the original pixel values.
The document provides biographical and research information about Amir Parnianifard. It summarizes his educational background, current affiliations with Glasgow College and the University of Electronic Science and Technology of China, and contact information. It then lists his main research interests as engineering design optimization, surrogate modeling, uncertainty quantification, and other topics in computational intelligence and optimal control. The document provides an overview of differential evolution algorithms and includes MATLAB code examples for implementing differential evolution optimization and other related modeling techniques.
The document describes a simple approach for text-to-image generation using a transformer that models text and image tokens as a single stream. It involves training the transformer in two stages: (1) Pretraining a VQ-VAE to encode images into discrete tokens, and (2) Training the transformer to autoregressively model the joint distribution of image tokens and BPE-encoded text tokens. With sufficient data and scale, this approach is competitive with previous domain-specific models for text-to-image generation.
Paper Introduction "Density-aware person detection and tracking in crowds"壮 八幡
This document summarizes a paper on detecting and tracking people in crowded scenes. It proposes an energy formulation approach that leverages global scene structure and resolves all detections jointly. The approach formulates detection as an energy minimization problem involving terms for person detector confidence scores, non-overlapping detections, and crowd density estimation. It estimates crowd density using a Gaussian mixture model and learns model parameters by minimizing a mean squared error distance between annotated and estimated density maps.
This document summarizes a doctoral thesis presentation on statistical learning theory for parameter-restricted singular models. It discusses how singular models like hierarchical and latent variable models are important in statistical model design but traditional learning theory cannot analyze the generalization error of such singular models. The presentation analyzes the generalization error of non-negative matrix factorization (NMF) and latent Dirichlet allocation (LDA) as examples of parameter-restricted singular models. It derives an upper bound for the real log-threshold of NMF which determines the generalization error of singular models, and precisely analyzes the real log-threshold of LDA by relating it to a constrained matrix factorization.
1. The document proposes using Bayesian inverse reinforcement learning (IRL) with neural networks for anomaly prediction detection. It formulates the problem as a Markov decision process to learn the reward function from expert trajectories.
2. A Bayesian neural network is used to model the reward function, with weights assigned prior distributions. The model is trained by maximizing the log likelihood of the training data to find the posterior distribution over weights.
3. The approach is evaluated on temperature anomaly detection and maze navigation tasks. Bayesian IRL is able to distinguish normal trajectories from anomalous ones in test data for intentional anomaly detection.
Towards Accurate Multi-person Pose Estimation in the Wild (My summery)Abdulrahman Kerim
This presentation summarizes a paper on multi-person pose estimation using a two-stage deep learning model. The approach uses a Faster R-CNN model to detect person boxes, then applies a separate ResNet model to each box to predict keypoints. It trains on the COCO dataset and evaluates on COCO test images, achieving state-of-the-art accuracy for multi-person pose estimation. Key aspects covered include the motivation, problem definition, approach using heatmap and offset predictions, model training procedure, evaluation metrics and results.
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...Shuhei Yoshida
Unsupervised learning of disentangled representations was the goal. The approach was to use GANs and maximize the mutual information between generated images and input codes. This led to the benefit of obtaining interpretable representations without supervision and at substantial additional costs.
1) The document proposes TransNeRF, a transfer learning framework for neural radiance fields (NeRF) that improves scene reconstruction efficiency.
2) TransNeRF uses two MLPs - one for 3D scene generation and another for color emission. It also uses generative latent optimization to account for photometric variations.
3) TransNeRF is trained from lower to higher resolution images. The first MLP predicts geometry, while the second MLP's weights are transferred between resolutions, allowing geometry to remain stable while radiance varies per image.
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimationtaeseon ryu
안녕하세요 딥러닝 논문읽기 모임 입니다! 오늘 소개 드릴 논문은 Deep High-Resolution Representation Learning for Human Pose Estimation 라는 제목의 논문입니다.
오늘 소개드릴 논문은 Pose Estimation에 관련된 논문 입니다. 기존 Pose Estimation 모델의 경우 직렬적인 네트워크 구조를 지녔지만, 직렬적인 구조는 압축하는 과정에서
지엽적인 정보들의 손실을 가져오게 되고 모든 프로세스가 upsampling에 과도하게 의존하고 있다는 한계점을 가지고 있습니다.그래서 이러한 한계점을 극복하고자HRNet은 이러한 직렬 구조에서 벗어나 병렬 구조로 subnetwork를 구성했습니다.
Soft computing is likely to play aprogressively important role in many applications including image enhancement. The paradigm for soft computing is the human mind. The soft computing critique has been particularly strong with fuzzy logic. The fuzzy logic is facts representationas a
rule for management of uncertainty. Inthis paperthe Multi-Dimensional optimized problem is addressed by discussing the optimal thresholding usingfuzzyentropyfor Image enhancement. This technique is compared with bi-level and multi-level thresholding and obtained optimal
thresholding values for different levels of speckle noisy and low contrasted images. The fuzzy entropy method has produced better results compared to bi-level and multi-level thresholding techniques.
The document provides an overview of key concepts in image processing, including definitions of digital images, image formats, data types, processing operations, and mathematical foundations. It defines what an image and digital image are, explains color models and image resolution. It also covers common image file formats, data types, the scope of image processing including low, mid, and high-level operations. Additionally, it introduces basic terms in image topology and the mathematical concepts of image formation, properties of the point spread function, and linear shift-invariance and convolution.
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a nonlinear embedding by preserving class neighbourhood structure." International Conference on Artificial Intelligence and Statistics. 2007.
The document summarizes Yan Xu's upcoming presentation at the Houston Machine Learning Meetup on dimension reduction techniques. Yan will cover linear methods like PCA and nonlinear methods such as ISOMAP, LLE, and t-SNE. She will explain how these methods work, including preserving variance with PCA, using geodesic distances with ISOMAP, and modeling local neighborhoods with LLE and t-SNE. Yan will also demonstrate these methods on a dataset of handwritten digits. The meetup is part of a broader roadmap of machine learning topics that will be covered in future sessions.
Image denoising with unknown Non-Periodic NoisesSakshiAggarwal85
The experimental study is learnt from application of image restoration techniques. As far as literature review is concerned, the prior knowledge of noise or unwanted signal might have advantage in promising and encouraging results. So, we thought of embedding procedure into our work that completely unacquainted of noise model. In this project, we try to reconstruct original image with incorporation of those restoration models, given the target noisy image. Three known blind deconvolution models; Lucy-Richardson algorithm, Weiner-Hunt algorithm and Total Variation Regularization, are implemented. Results are compared and presented
Linear regression is a machine learning algorithm that models the relationship between independent variables (x) and a continuous dependent variable (y). It finds the best fit linear equation to model the relationship.
Multiple linear regression extends this to model relationships between a continuous dependent variable and two or more independent variables. The model represents the predicted output as a linear combination of the input variables.
Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. It can be used to learn the parameters of a linear regression model by minimizing the cost function, which represents the error between predictions and true values. Feature scaling helps ensure features are on a similar
Tutorial Equivariance in Imaging ICMS 23.pptxJulián Tachella
Equivariant deep learning enables unsupervised learning of inverse problems from measurements alone by exploiting signal symmetries. The measurement operator must not be equivariant to the symmetry group in order for the underlying signal set to be uniquely identified. If the signal set has low dimensionality and the symmetry group is large, the number of measurements needed is the same as for supervised signal recovery. This approach generalizes supervised training by allowing learning from unlabeled data.
Behavior study of entropy in a digital image through an iterative algorithmijscmcj
Image segmentation is a critical step in computer vision tasks constituting an essential issue for pattern recognition and visual interpretation. In this paper, we study the behavior of entropy in digital images through an iterative algorithm of mean shift filtering. The order of a digital image in gray levels is defined. The behavior of Shannon entropy is analyzed and then compared, taking into account the number of iterations of our algorithm, with the maximum entropy that could be achieved under the same order. The use of equivalence classes it induced, which allow us to interpret entropy as a hyper-surface in real m dimensional space. The difference of the maximum entropy of order n and the entropy of the image is used to group the the iterations, in order to caractrizes the performance of the algorithm.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Image Interpolation Techniques with Optical and Digital Zoom Conceptsmmjalbiaty
Digital image concepts and interpolation techniques for optical and digital zoom are discussed. There are three main types of interpolation used for resizing images: nearest neighbor, bilinear, and bicubic. Nearest neighbor is the simplest but produces the lowest quality, while bicubic is the most complex but highest quality. Optical zoom uses lens magnification before sensing, whereas digital zoom interpolates after sensing, resulting in lower quality than optical zoom. Interpolation methods assign pixel values to new locations during resizing based on weighting patterns around the original pixel values.
The document provides biographical and research information about Amir Parnianifard. It summarizes his educational background, current affiliations with Glasgow College and the University of Electronic Science and Technology of China, and contact information. It then lists his main research interests as engineering design optimization, surrogate modeling, uncertainty quantification, and other topics in computational intelligence and optimal control. The document provides an overview of differential evolution algorithms and includes MATLAB code examples for implementing differential evolution optimization and other related modeling techniques.
The document describes a simple approach for text-to-image generation using a transformer that models text and image tokens as a single stream. It involves training the transformer in two stages: (1) Pretraining a VQ-VAE to encode images into discrete tokens, and (2) Training the transformer to autoregressively model the joint distribution of image tokens and BPE-encoded text tokens. With sufficient data and scale, this approach is competitive with previous domain-specific models for text-to-image generation.
Paper Introduction "Density-aware person detection and tracking in crowds"壮 八幡
This document summarizes a paper on detecting and tracking people in crowded scenes. It proposes an energy formulation approach that leverages global scene structure and resolves all detections jointly. The approach formulates detection as an energy minimization problem involving terms for person detector confidence scores, non-overlapping detections, and crowd density estimation. It estimates crowd density using a Gaussian mixture model and learns model parameters by minimizing a mean squared error distance between annotated and estimated density maps.
This document summarizes a doctoral thesis presentation on statistical learning theory for parameter-restricted singular models. It discusses how singular models like hierarchical and latent variable models are important in statistical model design but traditional learning theory cannot analyze the generalization error of such singular models. The presentation analyzes the generalization error of non-negative matrix factorization (NMF) and latent Dirichlet allocation (LDA) as examples of parameter-restricted singular models. It derives an upper bound for the real log-threshold of NMF which determines the generalization error of singular models, and precisely analyzes the real log-threshold of LDA by relating it to a constrained matrix factorization.
1. The document proposes using Bayesian inverse reinforcement learning (IRL) with neural networks for anomaly prediction detection. It formulates the problem as a Markov decision process to learn the reward function from expert trajectories.
2. A Bayesian neural network is used to model the reward function, with weights assigned prior distributions. The model is trained by maximizing the log likelihood of the training data to find the posterior distribution over weights.
3. The approach is evaluated on temperature anomaly detection and maze navigation tasks. Bayesian IRL is able to distinguish normal trajectories from anomalous ones in test data for intentional anomaly detection.
Towards Accurate Multi-person Pose Estimation in the Wild (My summery)Abdulrahman Kerim
This presentation summarizes a paper on multi-person pose estimation using a two-stage deep learning model. The approach uses a Faster R-CNN model to detect person boxes, then applies a separate ResNet model to each box to predict keypoints. It trains on the COCO dataset and evaluates on COCO test images, achieving state-of-the-art accuracy for multi-person pose estimation. Key aspects covered include the motivation, problem definition, approach using heatmap and offset predictions, model training procedure, evaluation metrics and results.
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...Shuhei Yoshida
Unsupervised learning of disentangled representations was the goal. The approach was to use GANs and maximize the mutual information between generated images and input codes. This led to the benefit of obtaining interpretable representations without supervision and at substantial additional costs.
1) The document proposes TransNeRF, a transfer learning framework for neural radiance fields (NeRF) that improves scene reconstruction efficiency.
2) TransNeRF uses two MLPs - one for 3D scene generation and another for color emission. It also uses generative latent optimization to account for photometric variations.
3) TransNeRF is trained from lower to higher resolution images. The first MLP predicts geometry, while the second MLP's weights are transferred between resolutions, allowing geometry to remain stable while radiance varies per image.
HRNET : Deep High-Resolution Representation Learning for Human Pose Estimationtaeseon ryu
안녕하세요 딥러닝 논문읽기 모임 입니다! 오늘 소개 드릴 논문은 Deep High-Resolution Representation Learning for Human Pose Estimation 라는 제목의 논문입니다.
오늘 소개드릴 논문은 Pose Estimation에 관련된 논문 입니다. 기존 Pose Estimation 모델의 경우 직렬적인 네트워크 구조를 지녔지만, 직렬적인 구조는 압축하는 과정에서
지엽적인 정보들의 손실을 가져오게 되고 모든 프로세스가 upsampling에 과도하게 의존하고 있다는 한계점을 가지고 있습니다.그래서 이러한 한계점을 극복하고자HRNet은 이러한 직렬 구조에서 벗어나 병렬 구조로 subnetwork를 구성했습니다.
Soft computing is likely to play aprogressively important role in many applications including image enhancement. The paradigm for soft computing is the human mind. The soft computing critique has been particularly strong with fuzzy logic. The fuzzy logic is facts representationas a
rule for management of uncertainty. Inthis paperthe Multi-Dimensional optimized problem is addressed by discussing the optimal thresholding usingfuzzyentropyfor Image enhancement. This technique is compared with bi-level and multi-level thresholding and obtained optimal
thresholding values for different levels of speckle noisy and low contrasted images. The fuzzy entropy method has produced better results compared to bi-level and multi-level thresholding techniques.
The document provides an overview of key concepts in image processing, including definitions of digital images, image formats, data types, processing operations, and mathematical foundations. It defines what an image and digital image are, explains color models and image resolution. It also covers common image file formats, data types, the scope of image processing including low, mid, and high-level operations. Additionally, it introduces basic terms in image topology and the mathematical concepts of image formation, properties of the point spread function, and linear shift-invariance and convolution.
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a nonlinear embedding by preserving class neighbourhood structure." International Conference on Artificial Intelligence and Statistics. 2007.
The document summarizes Yan Xu's upcoming presentation at the Houston Machine Learning Meetup on dimension reduction techniques. Yan will cover linear methods like PCA and nonlinear methods such as ISOMAP, LLE, and t-SNE. She will explain how these methods work, including preserving variance with PCA, using geodesic distances with ISOMAP, and modeling local neighborhoods with LLE and t-SNE. Yan will also demonstrate these methods on a dataset of handwritten digits. The meetup is part of a broader roadmap of machine learning topics that will be covered in future sessions.
Image denoising with unknown Non-Periodic NoisesSakshiAggarwal85
The experimental study is learnt from application of image restoration techniques. As far as literature review is concerned, the prior knowledge of noise or unwanted signal might have advantage in promising and encouraging results. So, we thought of embedding procedure into our work that completely unacquainted of noise model. In this project, we try to reconstruct original image with incorporation of those restoration models, given the target noisy image. Three known blind deconvolution models; Lucy-Richardson algorithm, Weiner-Hunt algorithm and Total Variation Regularization, are implemented. Results are compared and presented
Linear regression is a machine learning algorithm that models the relationship between independent variables (x) and a continuous dependent variable (y). It finds the best fit linear equation to model the relationship.
Multiple linear regression extends this to model relationships between a continuous dependent variable and two or more independent variables. The model represents the predicted output as a linear combination of the input variables.
Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. It can be used to learn the parameters of a linear regression model by minimizing the cost function, which represents the error between predictions and true values. Feature scaling helps ensure features are on a similar
Tutorial Equivariance in Imaging ICMS 23.pptxJulián Tachella
Equivariant deep learning enables unsupervised learning of inverse problems from measurements alone by exploiting signal symmetries. The measurement operator must not be equivariant to the symmetry group in order for the underlying signal set to be uniquely identified. If the signal set has low dimensionality and the symmetry group is large, the number of measurements needed is the same as for supervised signal recovery. This approach generalizes supervised training by allowing learning from unlabeled data.
Behavior study of entropy in a digital image through an iterative algorithmijscmcj
Image segmentation is a critical step in computer vision tasks constituting an essential issue for pattern recognition and visual interpretation. In this paper, we study the behavior of entropy in digital images through an iterative algorithm of mean shift filtering. The order of a digital image in gray levels is defined. The behavior of Shannon entropy is analyzed and then compared, taking into account the number of iterations of our algorithm, with the maximum entropy that could be achieved under the same order. The use of equivalence classes it induced, which allow us to interpret entropy as a hyper-surface in real m dimensional space. The difference of the maximum entropy of order n and the entropy of the image is used to group the the iterations, in order to caractrizes the performance of the algorithm.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
4. • YOLACT: Real-time Instance Segmentation
Image Segmentation
Instance Segmentation
2020.04~2020.08
• MODNet: Trimap-Free Portrait Matting in Real Time
• Real-Time High-Resolution Background Matting(BGMv2)
Image Segmentation
Image Matting
2020.09~2020.12
4
5. • PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization
• Expressive Body Capture: 3D Hands, Face, and Body from a Single Image(SMPL eXpressive)
• SMPL: A Skinned Multi-Person Linear Model
3D Reconstruction
Human Digitalization
2020.06~2020.10
SMPL SMPL X
PIFu
5
6. • Unsupervised Real-world Image Super Resolution via Domain-distance Aware Training(DASR)
• Designing a Practical Degradation Model for Deep Blind Image Super-Resolution(BSR GAN)
• SwinIR: Image Restoration Using Swin Transformer
• Towards Robust Blind Face Restoration with Codebook Lookup Transformer(CodeFormer)
• Inference를 통해 장단점 분석
Image Restoration
Super Resolution
2021.05~2022.11
6
7. • FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
• MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement
• Face2Face: Real-time Face Capture and Reenactment of RGB Videos
Lip Generation
Text to Mesh
안녕하세요
Fast
Speech2 Visemes
Phonemes
Phonemes : 음소
Visemes : 음소의 시각적 정의
3D Mesh
2021.10~2022.01
7
8. Lip Generation
• ObamaNet: Photo-realistic lip-sync from text
• Image-to-Image Translation with Conditional Adversarial Nets(Pix2Pix)
Speech to Image
안녕하세요 Char2Wav LSTM
Pix2Pix
Input
Input Output
2022.02~2022.05
8
9. Lip Generation
• A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild(Wav2Lip)
Speech to Image
2022.06~Now
9
10. Lip Generation
• A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild(Wav2Lip)
• SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory
• Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
Speech to Image
2022.06~Now
10
11. Lip Generation
• A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild(Wav2Lip)
• SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory
• Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
Speech to Image
2022.06~Now
11
12. • Data Parallelization
• Model Parallelization
DNN의 가장 큰 병목 현상인 학습시간의 단축을 multi-GPU의 활용으로 해결하고자 함
Appendix
DNN Parallelization
2018.02~2019.12
12
14. • Multimodal learning
• Audio-Image
• Lip Generation
• Wav2Lip 후속 연구
• Generative model
• Stable Diffusion
• Audio Representation과 Image Representation을 추출하는 방식을 바꾸어서 생성모델의 품질을 높이는 방향
• 생성모델을 기존의 GAN 방식에서 Diffusion Model과 같은 다른 생성기법을 반영해 생성 품질을 높이는 방향
• 2D Image의 한계를 극복하기 위해 3D 정보를 Estimate해서 더 사실과 가까운 생성물을 생성하는 방향
• Representing Scenes as Neural Radiance Fields for View Synthesis(NeRF)
• Learning Transferable Visual Models From Natural Language Supervision(CLIP)
• Denoising Diffusion Probabilistic Models(Diffusion Model)
• High-Resolution Image Synthesis with Latent Diffusion Models
연구 계획
14
15. NeRF: Representing Scenes as
Neural Radiance Fields for View Synthesis
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,
Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng
AI Labs 영상처리파트
오현우주임
2023.01.31
16. 3D Reconstruction vs Volume Rendering
• Reality Capture
https://www.youtube.com/watch?v=9kIPixG8GHA
3D Reconstruction Volume Rendering
3차원 형태의 샘플링 데이터를
2차원 투시로 보여주는 기술
NeRF 는 이 쪽에 속함
17. 1) March camera rays through the scene to generate a sampled set of 3D points
2) Use those points and their corresponding 2D viewing directions as input
to the neural network to produce an output set of colors and densities
3) Use classical volume rendering techniques to accumulate those colors and densities into a 2D image.
NeRF Model - 1
18. 𝐹Θ: 𝑋, 𝑑 → (𝑐, 𝜎)
𝑑 = (𝜃, 𝜙)
𝑋 = (𝑥, 𝑦, 𝑧) 𝑐 = (𝑟, 𝑔, 𝑏)
𝜎 = (𝜎)
Ray가 지나가는 곳에 존재하는 vertices의 x,y,z 값
Density(밀도)를 뜻하며 density가 커지면
물체가 불투명해지고(뒤에 있는 것들이 잘 보이지 않음),
density가 작아지면 물체가 투명해짐
NeRF Model - 2
21. Volume Rendering
density RGB
far
near
density
Density값에 마이너스를 주고 exp를 한 것은
해당 vertices의 위치에서 앞의 density가 클수록
내 weight를 작게 가져가겠다는 의미
대충 𝑡𝑖 Random Sampling 한다는 뜻
Camera Ray를 뜻함
• Model의 Output으로 나온 한 Ray의 Color와 density 값들은 한 pixel로 합쳐지는 Volume Rendering 과정을 거
침
• 합쳐진 pixel rgb값은 실제 이미지의 pixel rgb값과 MSE Loss를 거쳐 Back propagation을 통해 학습이 진행
COARSE MODEL
22. Positional Encoding
High frequency 데이터를 얻기 위해서 진행
data augmentation의 일종이라고 보면 됨
𝑋 = (𝑥, 𝑦, 𝑧)
POSITIONAL
ENCODING
𝑑 = (𝜃, 𝜙)
POSITIONAL
ENCODING
𝜎 = (𝜎)
𝑐 = (𝑟, 𝑔, 𝑏)
3차원을 60차원으로
3차원을 24차원으로
25. NeRF 단점
1. 느린 속도
NeRF는 Training 및 Rendering 속도가 느림
NeRF 모델 하나당 한 물체를 표현할 수 있음
학습을 (200k~300k 기준) 한 번 돌리는데 대략 1~2일 소요
-> DeRF(21CVPR) , NeRF++, plenoxel(22CVPR)
2. NeRF는 Static한 Scene에서만 성능이 좋음
움직이는 물체가 있는 Scene에 대해서 많은 Noise를 생성
-> D-NeRF(21CVPR) , Nerfies(21ICCV) , HyperNeRF
26. NeRF 단점
3. NeRF는 같은 환경에서 촬영한 이미지에 대해서만 성능이 나옴
Static 한 물체더라도 날씨, 시간등에 따라 명암, 색상 빛 조건 등이 다를 수 있고,
Real world에는 사실 스튜디오에서 찍는게 아닌 이상 이런 데이터들이 더 많음
-> NeRV(21CVPR) , NeRD(21CVPR) , NeRF in the wild (21CVPR)
4. NeRF는 general한 Model이 아님
NeRF는 한 Model로 하나의 물체만 만들어낼 수 있음
-> GIRAFFE(21CVPR) , pixel-NeRF(22CVPR) ...
27. NeRF 단점
5. 너무 다양한 시점의 training set이 필요
NeRF에 input으로 들어가는 synthetic Training dataset은 100개
한 물체를 학습 하기 위해 100장의 사진을 찍는 것은 inefficient
몇 장의 사진만으로 물체를 렌더링 하는 연구 필요
-> pixel-NeRF(22CVPR) , DietNeRF(21ICCV), Instant-NGP
6. NeRF의 Input인 Camera Parameter
NeRF에서는 카메라의 위치를 알기 위한 intrinsic parameter와 extrinsic parameter 값이 필요
일반인이 스마트폰 카메라 등으로 물체를 촬영하여 학습을 하기에는 너무 많은 정보
이를 해결하기 위해 pose를 estimate, 혹은 pose 자체를 학습하는 등의 연구가 진행
-> iNeRF(21IROS) , NeRF-- , GNeRF(21ICCV) , BARF(21ICCV) , SCNeRF(21ICCV)
29. Entropy, Relative Entropy and Mutual Information*
*Elements of Information Theory Thomas M. Cover, Joy A. Thomas
Definition 1 : The entropy of a discrete
random variable with p.d.f 𝑝(𝑥) is defined as,
𝐻 𝑥 : = −
𝑥
𝑝(𝑥) 𝑙𝑜𝑔 𝑝(𝑥)
Convention : 0 log 0 = 0
Remark 1 : The entropy 𝐻 𝑥 is a measure of
the average uncertainty of random variable 𝑋.
We can write also
𝐻 𝑥 = −
𝑥
𝑝(𝑥) 𝑙𝑜𝑔 𝑝(𝑥)
= −
𝑥
𝑝(𝑥) 𝑙𝑜𝑔 𝑝(
1
𝑝(𝑥)
)
Lemma 1 : H(x) ≥ 0
Proof : By definition,
𝐻(𝑥) = −
𝑥
𝑝(𝑥) 𝑙𝑜𝑔 𝑝(𝑥)
= −
𝑥
𝑝(𝑥)(− 𝑙𝑜𝑔 𝑝(𝑥)) ≥ 0
Definition 2 : The joint entropy 𝐻(𝑋, 𝑌) of a pair of
discrete random variable (𝑋, 𝑌) with joint p.d.f 𝑝(𝑥, 𝑦) is
defined as,
𝐻 𝑥, y = −
x y
𝑝(𝑥, y) 𝑙𝑜𝑔 𝑝(𝑥, y)
Clearly 𝐻(𝑋, 𝑌) ≥ 0
31. Definition 4 : The Kullback-Leibler divergence or
relative entropy between two probability mass
function 𝑝 𝑥 and 𝑞(𝑥) is defined as
𝐷(𝑝| 𝑞 ≔
𝑥
𝑝(𝑥) log
𝑝(𝑥)
𝑞(𝑥)
= 𝛦𝑝 𝑙𝑜𝑔
𝑝(𝑥)
𝑞(𝑥)
Remark 2 : 𝐷(𝑝||𝑞) ≠ 𝐷(𝑞||𝑝)
Fact : 𝐷(𝑝||𝑞) ≥ 0
Proof :
𝐷(𝑝| 𝑞 =
𝑥
𝑝(𝑥) log
𝑝(𝑥)
𝑞(𝑥)
=
𝑥
𝑝(𝑥)(−1) log
𝑞(𝑥)
𝑝(𝑥)
≥ −1 log
𝑥
𝑝 𝑥
𝑞 𝑥
𝑝 𝑥
= −1 log 1 = 0
By Jensen’s inequality of convex function
Definition 5 : The cross-entropy of p.d.f p and q is defined as
𝐻𝑞 𝑝 ≔ −
𝑥
𝑝(𝑥) log 𝑞(𝑥)
Where 𝑝(𝑥) is unknown, and 𝑞(𝑥) is an approximated p.d.f
Observation :
(1)
𝐷(𝑝| 𝑞 ≔
𝑥
𝑝(𝑥) log
𝑝(𝑥)
𝑞(𝑥)
=
𝑥
𝑝(𝑥) log 𝑝 𝑥 −
𝑥
𝑝 𝑥 log 𝑞(𝑥)
= −𝐻 𝑝 + 𝐻𝑞(𝑝)
(2) 0 ≤ 𝐷(𝑝| 𝑞 = 𝐻𝑞 𝑝 − 𝐻(𝑝)
i.e. 𝐻𝑞(𝑝) ≥ 𝐻(𝑝)
(3) Since 𝑝 𝑥 is fixed,
minimizing 𝑫(𝒑| 𝒒 ⇔ minimizing 𝑯𝒒(𝒑)
Entropy, Relative Entropy and Mutual Information
32. Definition 6 : The mutual information 𝐼(𝑋; 𝑌) of two r.v’s
𝑋 and 𝑌 is defined as
𝐼 𝑋; 𝑌 ∶=
𝑥 𝑦
𝑝 𝑥, 𝑦 𝑙𝑜𝑔
𝑝(𝑥, 𝑦)
𝑝 𝑥 𝑝(𝑦)
= 𝛦𝑝 𝑙𝑜𝑔
𝑝(𝑥)
𝑞(𝑥)
Observation : 𝐼 𝑋; 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌
= 𝐻 𝑌 − 𝐻(𝑌|𝑋)
Proof :
𝐼 𝑋; 𝑌 =
𝑥 𝑦
𝑝 𝑥, 𝑦 𝑙𝑜𝑔
𝑝(𝑥, 𝑦)
𝑝 𝑥 𝑝(𝑦)
=
𝑥 𝑦
𝑝 𝑥, 𝑦 𝑙𝑜𝑔
𝑝 𝑥 𝑦 𝑝(𝑦)
𝑝 𝑥 𝑝(𝑦)
=
𝑥 𝑦
𝑝 𝑥, 𝑦 𝑙𝑜𝑔
𝑝 𝑥 𝑦
𝑝 𝑥
=
𝑥 𝑦
𝑝 𝑥, 𝑦 𝑙𝑜𝑔 𝑝 𝑥 𝑦 −
𝑥 𝑦
𝑝 𝑥, 𝑦 𝑙𝑜𝑔 𝑝(𝑥)
= −𝐻 𝑋 𝑌 + 𝐻 𝑋 = −𝐻 𝑌 𝑋 + 𝐻(𝑌)
Remark 3 :
𝐼(𝑋; 𝑌) is the reduction in the uncertainty of 𝑋 due to the
information of 𝑌 (of 𝑌 due to the information of 𝑋)
Proposition :
𝐼(𝑋; 𝑌) ≥ 0 with equality holds ⇔ 𝑋 and 𝑌 are independent.
Proof :
𝐼 𝑋; 𝑌 =
𝑥 𝑦
𝑝 𝑥, 𝑦 𝑙𝑜𝑔
𝑝(𝑥, 𝑦)
𝑝 𝑥 𝑝(𝑦)
≔ 𝐷(𝑝(𝑥, 𝑦)||𝑝 𝑥 𝑝(𝑦) ≥ 0
Corollary : 𝐼 𝑋; 𝑌 𝑍 ≥ 0
Proof : 𝐼 𝑋; 𝑌 𝑍 ≔ 𝐷(𝑝(𝑥, 𝑦|𝑧)||𝑝 𝑥 𝑧 𝑝 𝑦 𝑧 ) ≥ 0
because 𝐷(𝑝||𝑞) ≥ 0
Entropy, Relative Entropy and Mutual Information
33. Contrastive Learning
Problem : For a given massive data set 𝑋 = {𝑥1, 𝑥2, … , 𝑥𝑇} without labels, how do we learn an encoder 𝑓𝜃(⋅)
(representation) which will be used for the downstream task such as classification or clustering
Idea : For each data point 𝑥 ∈ 𝑋 ,
(1) Randomly draw a positive sample 𝑥+
from 𝑥
(2) Randomly draw 𝑁 − 1 negative samples {𝑥𝑗
−
; 𝑗 = 1,2, … , 𝑁 − 1} from different classes
(3) Choose any type of neural network and learn the encoder 𝑓𝜃(⋅)
(4) Choose a score function 𝑓𝜃 𝑥 𝑇
𝑓𝜃 𝑥+
, 𝑓𝜃 𝑥 𝑇
𝑓𝜃 𝑥𝑗
−
for example
(5) A loss function given 1 positive sample and 𝑁 − 1 negative samples is
𝐿 𝜃 = −Ε𝑥 log
exp 𝑓𝜃
𝑇
𝑥 𝑓𝜃 𝑥+
exp 𝑓𝜃
𝑇
𝑥 𝑓𝜃 𝑥+ + 𝑗=1
𝑁−1
𝑒𝑥𝑝 𝑓𝜃
𝑇
𝑥 𝑓𝜃 𝑥𝑗
−
(6) For a sample {𝑥𝑙}𝑙=1
𝑀
⊂ 𝑋 ≔ {𝑥1, 𝑥2, … , 𝑥𝑇} of batch size M, use the empirical loss function
𝐿𝑀 𝜃 = −
1
𝑀
𝑙=1
𝑀
log
exp 𝑓𝜃
𝑇
𝑥𝑙 𝑓𝜃 𝑥𝑙
+
exp 𝑓𝜃
𝑇
𝑥𝑙 𝑓𝜃 𝑥𝑙
+
+ 𝑗=1
𝑁−1
𝑒𝑥𝑝 𝑓𝜃
𝑇
𝑥𝑙 𝑓𝜃 𝑥𝑗
−
And update 𝜃 ; 𝜃 = 𝑎𝑟𝑔 min
𝜃
𝐿𝑀(𝜃) by gradient descent algorithm
This is the cross-entropy loss
⇒ for a 𝑁 -classes softmax classifier,
i.e. learn to find the positive sample
from the 𝑁 samples
34. Contrastive Learning
Architecture :
Algorithm :
(0) Input ; batch size 𝑀, network structure
(1) Randomly sample {𝑥𝑙}𝑙=1
𝑀
from 𝑋 ≔ {𝑥𝑡}𝑡=1
𝑇
(2) Randomly initialize all parameters
(3) For each data point 𝑥𝑙, randomly draw one positive sample 𝑥𝑙
+
from 𝑥𝑙 , (𝑁 − 1) negative samples 𝑥𝑗
−
(𝑒)
𝑗=1
𝑁−1
from different classes
(4) Compute the encoder, for 𝑙 = 1,2, … , 𝑀
𝑓𝜃 𝑥𝑙 , 𝑓𝜃 𝑥𝑙
+
, 𝑓𝜃(𝑥𝑗
−
(𝑙))
𝑗=1
𝑁−1
(5) Use the empirical cross-entropy loss,
𝐿𝑀 𝜃 = −
1
𝑀
𝑙=1
𝑀
log
exp 𝑓𝜃
𝑇
𝑥𝑙 𝑓𝜃 𝑥𝑙
+
exp 𝑓𝜃
𝑇
𝑥𝑙 𝑓𝜃 𝑥𝑙
+
+ 𝑗=1
𝑁−1
𝑒𝑥𝑝 𝑓𝜃
𝑇
𝑥𝑙 𝑓𝜃 𝑥𝑗
−
update all parameters by gradient descent algorithm ; 𝜃 = 𝑎𝑟𝑔 min
𝜃
𝐿𝑀(𝜃)
(6) Repeat the step(1)~step(5) until all updated parameters converges or change little within a given tolerance error
⇒ many many epochs
𝑥
𝑥+
𝑥𝑗
−
CNN 𝑓𝜃(⋅) Loss