Home ML Papers Lukasz Kaiser - One Model To Learn Them All (2017)

Lukasz Kaiser - One Model To Learn Them All (2017)

History / Edit / PDF / EPUB / BIB /
Created: June 22, 2017 / Updated: August 30, 2025 / Status: finished / 3 min read (~526 words)
machine-learning

A network composed of convolution blocks, attention blocks and of mixture-of-experts blocks which can process categorical data, text, audio and images
The network is able to improve its accuracy on existing tasks by learning tasks from another domain (image classification vs text translation)

Can we create a unified deep learning model to solve tasks across multiple domains?
Two key insights are crucial to making it (the MultiModel) work at all and are the main contributions of this work
- Small modality-specific sub-networks convert into a unified representation and back from it
  - The unified representation is variable-size
  - Different tasks from the same domain share modality nets
- Computational blocks of different kinds are crucial for good results on various problems

The MultiModel consists of a few small modality-nets, an encoder, I/O mixer, and an autoregressive decoder
The encoder and decoder are constructed using 3 computational blocks to get good performance across different problems:
- Convolutions allow the model to detect local patterns and generalize across space
- Attention layers allow to focus on specific elements to improve performance of the model
- Sparsely-gated mixture-of-experts gives the model capacity without excessive computational cost

A block of convolutions gets as input as tensor of shape [batch size, sequence length, feature channels] and returns a tensor of the same shape, processed as follows
For convolution operations, we use depthwise separable convolutions
- They are defined by a convolution on each feature channel separately, followed by a pointwise convolution to project to the desired feature depth
Noted $SepConv_{d,s,f}(W, x)
- d: dilation factor
- s: stride
- f: number of kernels of size $h \times w$
- W: weights
- x: input tensor
We use convolutions in blocks that consist of three components:
- A ReLU activation of the inputs
- A $SepConv$
- A layer normalization (LN)
A complete convolution step is defined as
$$ ConvStep_{d,s,f}(W, x) = LN(SepConv_{d,s,f}(W, ReLU(x)))$$
The convolutional steps are composed into blocks by stacking them and adding residual connections
We use stacks of four convolutional blocks with two skip-connections between the stack input and the outputs of the second and fourth convolutional steps

We use a multi-head dot-product attention mechanism
The inputs to the attention layer are two tensors: a source tensor and a target tensor both with the shape [batch size, sequence length, feature channels]
The target tensor is additively composed with a timing signal and mixed using two convolutional blocks

We use sparsely-gated mixture-of-experts layers: A mixture-of-expert layer consists of a number of simple feed-forward neural networks (experts) and a trainable gating network which selects a sparse combination of the experts to process each input

To allow the decoder to produce outputs for different tasks even with the same modality, we always start decoding with a command-token, such as To-English or To-Parse-Tree
We learn an embedding vector corresponding to each of the tokens during training

Kaiser, Lukasz, et al. "One Model To Learn Them All." arXiv preprint arXiv:1706.05137 (2017).