Audio is easier to fake. Normally, computers generate speech by linking lots of short recorded speech fragments to create a sentence. That is how the voice of Siri, Apple’s digital assistant, is generated. But digital voices like this are limited by the range of fragments they have memorised. They only sound truly realistic when speaking a specific batch of phrases.
Generative audio works differently, using neural networks to learn the statistical properties of the audio source in question, then reproducing those properties directly in any context, modelling how speech changes not just second-by-second, but millisecond-by-millisecond. Putting words into the mouth of Mr Trump, say, or of any other public figure, is a matter of feeding recordings of his speeches into the algorithmic hopper and then telling the trained software what you want that person to say. Alphabet’s DeepMind in Britain, Baidu’s Institute of Deep Learning in Silicon Valley and the Montreal Institute for Learning Algorithms (MILA) have all published highly realistic text-to-speech algorithms along these lines in the past year. Currently, these algorithms require levels of computing power only available to large technology companies, but that will change.