One Model To Learn Them All

Published:

Download

Notes

The main question they ask: Can we create a unified deep learning model to solve tasks across multiple domains?

Four main parts to the model:

  • Input encoder
  • I/O mixer
  • Output decoder
  • modality inputs (one for each input/output type: text data, images, audio categorical data)

The first three are made up using convolution blocks, mixture of experts, and attention blocks

They trained the model on eight different tasks: ImageNet, COOC image captioning, WSJ speech, WSJ text parsing, and two pairs of language translation tasks. Overall performace was less than SOTA on any given task, but that’s before any major attempt at hyperparameter optimization. Training on eight tasks improved performance vs training on a single task, in most cases.

Lots of links in this article to implementation details for the model parts.