Kelvin Xu - Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015)
History /
Edit /
PDF /
EPUB /
BIB /
Created: July 6, 2017 / Updated: November 2, 2024 / Status: finished / 1 min read (~185 words)
Created: July 6, 2017 / Updated: November 2, 2024 / Status: finished / 1 min read (~185 words)
- Two attention-based image caption generators under a common framework:
- A "soft" deterministic mechanism trainable by standard back-propagation methods
- A "hard" stochastic attention mechanism trainable by maximizing an approximate variational lower bound or equivalently by REINFORCE
- We use a convolutional neural network in order to extract a set of feature vectors which we refer to as annotation vectors
- The extractor produces $L$ vectors, each of which is a $D$-dimensional representation corresponding to a part of the image
- In order to obtain a correspondence between the feature vectors and portions of the 2D image, we extract features from a lower convolutional layer
- We use a long short-term memory (LSTM) network that produces a caption by generating one word at every time step conditioned on a context vector, the previous hidden state and the previously generated words
- Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International Conference on Machine Learning. 2015.