So Kembhavi and his colleagues Jaemin Cho, Jiasen Lu, and Hannaneh Hajishirzi decided to see if they could teach a model all this implicit visual knowledge by tweaking their approach to masking. Rather than train the model just to predict masked words in the captions from the corresponding photos, they also trained it to predict masked pixels in the photos on the basis of their corresponding captions.

The final images generated by the model aren’t exactly realistic. But that isn’t the point. They contain the right high-level visual concepts—the AI equivalent of a child drawing a stick figure to represent a human. (You can try out the model for yourself here.)