Handwritten Digit Recognition and performance of various modelsation[autosaved]SubhradeepMaji
This presentation is all about handwritten digit recognition of different people using Convolution Neural Network and compare the performance of different models based on different sequence of layers.
A kernel distribution is a non-parametric distribution of a random variable which can be used when parametric distribution cannot properly model the data.
Handwritten Digit Recognition and performance of various modelsation[autosaved]SubhradeepMaji
This presentation is all about handwritten digit recognition of different people using Convolution Neural Network and compare the performance of different models based on different sequence of layers.
A kernel distribution is a non-parametric distribution of a random variable which can be used when parametric distribution cannot properly model the data.
This is a presentation on Handwritten Digit Recognition using Convolutional Neural Networks. Convolutional Neural Networks give better results as compared to conventional Artificial Neural Networks.
Offline Character Recognition Using Monte Carlo Method and Neural Networkijaia
Human Machine interface are constantly gaining improvements because of increasing development of
computer tools. Handwritten Character Recognition do have various significant applications like form
scanning, verification, validation, or checks reading. Because of the importance of these applications
passionate research in the field of Off-Line handwritten character recognition is going on. The challenge in
recognising the handwritings lies in the nature of humans, having unique styles in terms of font, contours,
etc. This paper presents a novice approach to identify the offline characters; we call it as character divider
approach which can be used after pre-processing stage. We devise an innovative approach for feature
extraction known as vector contour. We also discuss the pros and cons including limitations, of our
approach
Deep learning for image super resolutionPrudhvi Raj
Using Deep Convolutional Networks, the machine can learn end-to-end mapping between the low/high-resolution images. Unlike traditional methods, this method jointly optimizes all the layers of the image. A light-weight CNN structure is used, which is simple to implement and provides formidable trade-off from the existential methods.
Fixed-Point Code Synthesis for Neural Networksgerogepatton
Over the last few years, neural networks have started penetrating safety critical systems to take decisions in robots, rockets, autonomous driving car, etc. A problem is that these critical systems often have limited computing resources. Often, they use the fixed-point arithmetic for its many advantages (rapidity, compatibility with small memory devices.) In this article, a new technique is introduced to tune the formats (precision) of already trained neural networks using fixed-point arithmetic, which can be implemented using integer operations only. The new optimized neural network computes the output with fixed-point numbers without modifying the accuracy up to a threshold fixed by the user. A fixed-point code is synthesized for the new optimized neural network ensuring the respect of the threshold for any input vector belonging the range [xmin, xmax] determined during the analysis. From a technical point of view, we do a preliminary analysis of our floating neural network to determine the worst cases, then we generate a system of linear constraints among integer variables that we can solve by linear programming. The solution of this system is the new fixed-point format of each neuron. The experimental results obtained show the efficiency of our method which can ensure that the new fixed-point neural network has the same behavior as the initial floating-point neural network.
Secret-Fragment-Visible Mosaic Image-Creation and Recovery via Colour Transfo...IJSRD
Secret-fragment-visible mosaic image which automatically transforms the secret image into a meaningful mosaic image of the same size. The mosaic image looks like to an arbitrarily selected target image. It may be used as a camouflage of the secret image and yielded by dividing the secret image into fragments and transforming their color characteristics to the corresponding blocks of the target image. Some technologies are designed to conduct the color transformation process so that the secret image may be recovered. The information required for recovering the secret is embedding into the created mosaic image. Good experimental results are showing the feasibility of the proposed method.
Large Convolutional Network models have
recently demonstrated impressive classification
performance on the ImageNet benchmark
(Krizhevsky et al., 2012). However
there is no clear understanding of why they
perform so well, or how they might be improved.
In this paper we address both issues.
We introduce a novel visualization technique
that gives insight into the function of intermediate
feature layers and the operation of
the classifier. Used in a diagnostic role, these
visualizations allow us to find model architectures
that outperform Krizhevsky et al. on
the ImageNet classification benchmark. We
also perform an ablation study to discover
the performance contribution from different
model layers. We show our ImageNet model
generalizes well to other datasets: when the
softmax classifier is retrained, it convincingly
beats the current state-of-the-art results on
Caltech-101 and Caltech-256 datasets
Summary:
There are three parts in this presentation.
A. Why do we need Convolutional Neural Network
- Problems we face today
- Solutions for problems
B. LeNet Overview
- The origin of LeNet
- The result after using LeNet model
C. LeNet Techniques
- LeNet structure
- Function of every layer
In the following Github Link, there is a repository that I rebuilt LeNet without any deep learning package. Hope this can make you more understand the basic of Convolutional Neural Network.
Github Link : https://github.com/HiCraigChen/LeNet
LinkedIn : https://www.linkedin.com/in/YungKueiChen
Deep learning lecture - part 1 (basics, CNN)SungminYou
This presentation is a lecture with the Deep Learning book. (Bengio, Yoshua, Ian Goodfellow, and Aaron Courville. MIT press, 2017) It contains the basics of deep learning and theories about the convolutional neural network.
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...CSCJournals
The traditional approach for solving the object recognition problem requires image representations to be first extracted and then fed to a learning model such as an SVM. These representations are handcrafted and heavily engineered by running the object image through a sequence of pipeline steps which requires a good prior knowledge of the problem domain in order to engineer these representations. Moreover, since the classification is done in a separate step, the resultant handcrafted representations are not tuned by the learning model which prevents it from learning complex representations that might would give it more discriminative power. However, in end-to-end deep learning models, image representations along with the classification decision boundary are all learnt directly from the raw data requiring no prior knowledge of the problem domain. These models deeply learn the object image representation hierarchically in multiple layers corresponding to multiple levels of abstraction resulting in representations that are more discriminative and give better results on challenging benchmarks. In contrast to the traditional handcrafted representations, the performance of deep representations improves with the introduction of more data, and more learning layers (more depth) and they perform well on large-scale machine learning problems. The purpose of this study is six fold: (1) review the literature of the pipeline processes used in the previous state-of-the-art codebook model approach for tackling the problem of generic object recognition, (2) Introduce several enhancements in the local feature extraction and normalization steps of the recognition pipeline, (3) compare the enhancements proposed to different encoding methods and contrast them to previous results, (4) experiment with current state-of-the-art deep model architectures used for object recognition, (5) compare between deep representations extracted from the deep learning model and shallow representations handcrafted through the recognition pipeline, and finally, (6) improve the results further by combining multiple different deep learning models into an ensemble and taking the maximum posterior probability.
This is a presentation on Handwritten Digit Recognition using Convolutional Neural Networks. Convolutional Neural Networks give better results as compared to conventional Artificial Neural Networks.
Offline Character Recognition Using Monte Carlo Method and Neural Networkijaia
Human Machine interface are constantly gaining improvements because of increasing development of
computer tools. Handwritten Character Recognition do have various significant applications like form
scanning, verification, validation, or checks reading. Because of the importance of these applications
passionate research in the field of Off-Line handwritten character recognition is going on. The challenge in
recognising the handwritings lies in the nature of humans, having unique styles in terms of font, contours,
etc. This paper presents a novice approach to identify the offline characters; we call it as character divider
approach which can be used after pre-processing stage. We devise an innovative approach for feature
extraction known as vector contour. We also discuss the pros and cons including limitations, of our
approach
Deep learning for image super resolutionPrudhvi Raj
Using Deep Convolutional Networks, the machine can learn end-to-end mapping between the low/high-resolution images. Unlike traditional methods, this method jointly optimizes all the layers of the image. A light-weight CNN structure is used, which is simple to implement and provides formidable trade-off from the existential methods.
Fixed-Point Code Synthesis for Neural Networksgerogepatton
Over the last few years, neural networks have started penetrating safety critical systems to take decisions in robots, rockets, autonomous driving car, etc. A problem is that these critical systems often have limited computing resources. Often, they use the fixed-point arithmetic for its many advantages (rapidity, compatibility with small memory devices.) In this article, a new technique is introduced to tune the formats (precision) of already trained neural networks using fixed-point arithmetic, which can be implemented using integer operations only. The new optimized neural network computes the output with fixed-point numbers without modifying the accuracy up to a threshold fixed by the user. A fixed-point code is synthesized for the new optimized neural network ensuring the respect of the threshold for any input vector belonging the range [xmin, xmax] determined during the analysis. From a technical point of view, we do a preliminary analysis of our floating neural network to determine the worst cases, then we generate a system of linear constraints among integer variables that we can solve by linear programming. The solution of this system is the new fixed-point format of each neuron. The experimental results obtained show the efficiency of our method which can ensure that the new fixed-point neural network has the same behavior as the initial floating-point neural network.
Secret-Fragment-Visible Mosaic Image-Creation and Recovery via Colour Transfo...IJSRD
Secret-fragment-visible mosaic image which automatically transforms the secret image into a meaningful mosaic image of the same size. The mosaic image looks like to an arbitrarily selected target image. It may be used as a camouflage of the secret image and yielded by dividing the secret image into fragments and transforming their color characteristics to the corresponding blocks of the target image. Some technologies are designed to conduct the color transformation process so that the secret image may be recovered. The information required for recovering the secret is embedding into the created mosaic image. Good experimental results are showing the feasibility of the proposed method.
Large Convolutional Network models have
recently demonstrated impressive classification
performance on the ImageNet benchmark
(Krizhevsky et al., 2012). However
there is no clear understanding of why they
perform so well, or how they might be improved.
In this paper we address both issues.
We introduce a novel visualization technique
that gives insight into the function of intermediate
feature layers and the operation of
the classifier. Used in a diagnostic role, these
visualizations allow us to find model architectures
that outperform Krizhevsky et al. on
the ImageNet classification benchmark. We
also perform an ablation study to discover
the performance contribution from different
model layers. We show our ImageNet model
generalizes well to other datasets: when the
softmax classifier is retrained, it convincingly
beats the current state-of-the-art results on
Caltech-101 and Caltech-256 datasets
Summary:
There are three parts in this presentation.
A. Why do we need Convolutional Neural Network
- Problems we face today
- Solutions for problems
B. LeNet Overview
- The origin of LeNet
- The result after using LeNet model
C. LeNet Techniques
- LeNet structure
- Function of every layer
In the following Github Link, there is a repository that I rebuilt LeNet without any deep learning package. Hope this can make you more understand the basic of Convolutional Neural Network.
Github Link : https://github.com/HiCraigChen/LeNet
LinkedIn : https://www.linkedin.com/in/YungKueiChen
Deep learning lecture - part 1 (basics, CNN)SungminYou
This presentation is a lecture with the Deep Learning book. (Bengio, Yoshua, Ian Goodfellow, and Aaron Courville. MIT press, 2017) It contains the basics of deep learning and theories about the convolutional neural network.
Shallow vs. Deep Image Representations: A Comparative Study with Enhancements...CSCJournals
The traditional approach for solving the object recognition problem requires image representations to be first extracted and then fed to a learning model such as an SVM. These representations are handcrafted and heavily engineered by running the object image through a sequence of pipeline steps which requires a good prior knowledge of the problem domain in order to engineer these representations. Moreover, since the classification is done in a separate step, the resultant handcrafted representations are not tuned by the learning model which prevents it from learning complex representations that might would give it more discriminative power. However, in end-to-end deep learning models, image representations along with the classification decision boundary are all learnt directly from the raw data requiring no prior knowledge of the problem domain. These models deeply learn the object image representation hierarchically in multiple layers corresponding to multiple levels of abstraction resulting in representations that are more discriminative and give better results on challenging benchmarks. In contrast to the traditional handcrafted representations, the performance of deep representations improves with the introduction of more data, and more learning layers (more depth) and they perform well on large-scale machine learning problems. The purpose of this study is six fold: (1) review the literature of the pipeline processes used in the previous state-of-the-art codebook model approach for tackling the problem of generic object recognition, (2) Introduce several enhancements in the local feature extraction and normalization steps of the recognition pipeline, (3) compare the enhancements proposed to different encoding methods and contrast them to previous results, (4) experiment with current state-of-the-art deep model architectures used for object recognition, (5) compare between deep representations extracted from the deep learning model and shallow representations handcrafted through the recognition pipeline, and finally, (6) improve the results further by combining multiple different deep learning models into an ensemble and taking the maximum posterior probability.
Image Captioning Generator using Deep Machine Learningijtsrd
Technologys scope has evolved into one of the most powerful tools for human development in a variety of fields.AI and machine learning have become one of the most powerful tools for completing tasks quickly and accurately without the need for human intervention. This project demonstrates how deep machine learning can be used to create a caption or a sentence for a given picture. This can be used for visually impaired persons, as well as automobiles for self identification, and for various applications to verify quickly and easily. The Convolutional Neural Network CNN is used to describe the alphabet, and the Long Short Term Memory LSTM is used to organize the right meaningful sentences in this model. The flicker 8k and flicker 30k datasets were used to train this. Sreejith S P | Vijayakumar A "Image Captioning Generator using Deep Machine Learning" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42344.pdf Paper URL: https://www.ijtsrd.comcomputer-science/artificial-intelligence/42344/image-captioning-generator-using-deep-machine-learning/sreejith-s-p
Scene recognition using Convolutional Neural NetworkDhirajGidde
Scene recognition is one of the hallmark tasks of computer vision, allowing definition of a context for object recognition. Whereas the tremendous recent progress in object recognition tasks is due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features, performance at scene recognition has not attained the same level of success.
Our paper on homogeneous motion discovery oriented reference frame for high efficiency video coding talks about the idea of segmenting the current frame into cohesive motion regions made of blocks and then using these regions to form a motion compensated prediction. This prediction when used as an additional reference frame for the current frame, shows encouraging savings in bit rate over standalone HEVC reference coder.
Image De-Noising Using Deep Neural Networkaciijournal
Deep neural network as a part of deep learning algorithm is a state-of-the-art approach to find higher level representations of input data which has been introduced to many practical and challenging learning problems successfully. The primary goal of deep learning is to use large data to help solving a given task on machine learning. We propose an methodology for image de-noising project defined by this model and conduct training a large image database to get the experimental output. The result shows the robustness and efficient our our algorithm.
IMAGE DE-NOISING USING DEEP NEURAL NETWORKaciijournal
Deep neural network as a part of deep learning algorithm is a state-of-the-art approach to find higher level representations of input data which has been introduced to many practical and challenging learning problems successfully. The primary goal of deep learning is to use large data to help solving a given task
on machine learning. We propose an methodology for image de-noising project defined by this model and conduct training a large image database to get the experimental output. The result shows the robustness and efficient our our algorithm.
APPLICATION OF IMAGE FUSION FOR ENHANCING THE QUALITY OF AN IMAGEcscpconf
Advances in technology have brought about extensive research in the field of image fusion.
Image fusion is one of the most researched challenges of Face Recognition. Face Recognition
(FR) is the process by which the brain and mind understand, interpret and identify or verify
human faces.. Image fusion is the combination of two or more source images which vary in
resolution, instrument modality, or image capture technique into a single composite
representation. Thus, the source images are complementary in many ways, with no one input
image being an adequate data representation of the scene. Therefore, the goal of an image
fusion algorithm is to integrate the redundant and complementary information obtained from
the source images in order to form a new image which provides a better description of the scene
for human or machine perception. In this paper we have proposed a novel approach of pixel
level image fusion using PCA that will remove the image blurredness in two images and
reconstruct a new de-blurred fused image. The proposed approach is based on the calculation
of Eigen faces with Principal Component Analysis (PCA). Principal Component Analysis (PCA)
has been most widely used method for dimensionality reduction and feature extraction
Targeted Visual Content Recognition Using Multi-Layer Perceptron Neural Networkijceronline
Visual Content Recognition has become an attractive research oriented field of computer vision and machine learning for the last few decades. The focus of this work is monument recognition. Imagesof significant locations captured and maintainedas data bases can be used by the travelers before visiting the places. They can use images of a famous building to know the description of the building. In all these applications, the visual content recognition plays a key role. Humans can learn the contents of the images and quickly identify them by seeing again. In this paper we present a constructive training algorithm for Multi-Layer Perceptron Neural Network (MLPNN) applied to a set of targeted object recognition applications. The target set consists of famous monuments in India for travel guide applications. The training data set (TDS) consists 3000 images. The Gist features are extracted for the images. These are given to the neural network during training phase.The mean square error (MSE) on the training data is computed and used as metric to adjust the weights of the neural network,using back propagation algorithm. In the constructive learning, if the MSE is less than a predefined value, the number of hidden neurons is increased. Input patterns are trained incrementally until all patterns of TDS are presented and learned. The parameters or weights obtained during the training phase are used in the testing phase, in which new untrained images are given to the neural network for recognition. If the test image is recognized, the details of the image will also be displayed. The performance accuracy of this method is found to be 95%
Image fusion is a sub field of image processing in which more than one images are fused to create an image where all the objects are in focus. The process of image fusion is performed for multi-sensor and multi-focus images of the same scene. Multi-sensor images of the same scene are captured by different sensors whereas multi-focus images are captured by the same sensor. In multi-focus images, the objects in the scene which are closer to the camera are in focus and the farther objects get blurred. Contrary to it, when the farther objects are focused then closer objects get blurred in the image. To achieve an image where all the objects are in focus, the process of images fusion is performed either in spatial domain or in transformed domain. In recent times, the applications of image processing have grown immensely. Usually due to limited depth of field of optical lenses especially with greater focal length, it becomes impossible to obtain an image where all the objects are in focus. Thus, it plays an important role to perform other tasks of image processing such as image segmentation, edge detection, stereo matching and image enhancement. Hence, a novel feature-level multi-focus image fusion technique has been proposed which fuses multi-focus images. Thus, the results of extensive experimentation performed to highlight the efficiency and utility of the proposed technique is presented. The proposed work further explores comparison between fuzzy based image fusion and neuro fuzzy fusion technique along with quality evaluation indices.
Text and Object Recognition using Deep Learning for Visually Impaired Peopleijtsrd
the main aim of this paper is to aid the visually impaired people with object detection and text detection using deep learning. Object detection is done using a convolution neural network and text recognition is done by optical character recognition. The detected output is converted into speech using text to the speech synthesizer. Object detection comprises of two methods. One is object localization and the other is image classification. Image classification refers to the prediction of classes of different objects within an image. Object localization infers the location of objects using bounding boxes. R. Soniya | B. Mounica | A. Joshpin Shyamala | Mr. D. Balakumaran "Text and Object Recognition using Deep Learning for Visually Impaired People" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-5 , August 2020, URL: https://www.ijtsrd.com/papers/ijtsrd31508.pdf Paper Url :https://www.ijtsrd.com/engineering/electronics-and-communication-engineering/31508/text-and-object-recognition-using-deep-learning-for-visually-impaired-people/r-soniya
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
2. Abstract
1. Introduce the first dataset for Sequential vision-to-language
2. SIND v.1 (Sequential Images Narrative Dataset)
3. 81743 unique photos in 20211 sequences
4. Establish string baselines
5. Move artificial intelligence from basic understandings of typical visual scenes
towards more and more human-like understanding of grounded event structure
and subjective expression
3. Table of content
1. Introduction
2. Motivation and Related Work
3. Dataset Construction
4. Data Analysis
5. Automatic Evaluation Metric
6. Baseline Experiments
7. Conclusion
4. Introduction
1. From description (concrete, literal) to narrative (abstract, further inference)
2. “Sitting next to each other” vs. “Having a good time”
3. Release three tiers of language for the same images
1) Descriptions of images-in-isolation (DII)
2) Descriptions of images-in-sequence (DIS)
3) Stories for images-in-sequence (SIS)
5. Motivation and Related Work
1. Image Captioning (2014,2015)
2. Question answering (2014)
3. Visual phrase (2011)
4. Vision Understanding (2013)
5. Visual Concepts (2015,2016)
Those works focus on direct, literal description od image content
6. Dataset Construction
1. Extracting Photos
1. Leverage the idea that “storyable” event tend to involve some form of
possession (John’s party; Shabnam’s visit)
2. Extract Flickr data with possessive dependency patterns (Standford CoreNLP)
3. Use WordNet3.0 to find out EVENT
4. Only include albums with 10 to 50 photos where all album photos are taken
within a 48-hour span
7. Dataset Construction
2. Crowdsourcing Stories In Sequence
1. 2-stage crowdsourcing
2. Storytelling : worker selects a subset of photos and writes a story about it
3. Re-telling :the worker writes a story based on one photo sequence generated
in the first stage
8. Dataset Construction
3. Crowdsourcing Descriptions of Images In Isolation & Images In
Sequence
1. Also collect descriptions of images-in-isolation and descriptions of images-in-
sequence
2. Follow the instructions for image captioning (MS COCO). Ex: describe all the
important parts
4. Data Post-processing
1. Replace name and identified named entities
10. Data analysis
1. Dataset includes 10117 Flickr albums with 210819 unique photos.
2. Use normalized pointwise mutual information to identify the words most closely
associated with each tier.
11. Automatic Evaluation Metric
1. Human judgment is the most reliable way to evaluate.
2. Compute pairwise correlation coefficients between automatic metrics and
human judgments (score from 1-5) on 3000 stories from SIS training set.
3. Automatic metrics : METROR, smoothed-BLUE and Skip-Thoughts
4. METEOR correlates best with human judgment.
12. Baseline Experiments
1. Use Sequence-to-Sequence recurrent neural net
2. Encode an image sequence by running an RNN over fc7 vectors of each image, in
reverse order.
3. Use Gated Recurrent Units (GRUs) for both the image encoder and story decoder
4. Initially, Beam search (size = 100); but there’s lots of repetitive sentences.
5. Greedy search significantly increase the story quality.
6. Same content word cannot be produced more than once within a given story.
7. Filter out some “visually grounded” words
14. The details of the training were:
1. Extract 4096-dim FC 7 features using VGG16 without fine tuning
2. The encoder reads over the 5 images in a sequence. The order of images are reversed (i.e., the
first image in the sequence is the last one read in, following what is commonly done for
machine translation. This is probably not important though).
3. The encoder and decoder are 1000 dimensional GRU (no weight sharing)
4. The target word embedding size is 250 dimension (i.e., the dimension when the word that was
just produced is fed into the decoder GRU).
5. The target vocab size is words that occur 3 or more times in the training. Other words are
mapped to UNK (there is a constraint in the decoder that UNK cannot be produced at test time,
however).
6. 0.5 dropout on the image FC7 input (i.e., 50% of the 4096-dim FC7 features are dropped out
before being fed into encoder GRU. This is probably not important).
7. 0.5 dropout on the decoder GRU layer before applying it to the output layer.
8. If the story model is co-trained with caption data, you should use a token in the encoder GRU to
indicate which type of output to produce.
9. It's analogous to machine translation sequence-to-sequence models.