Anyone who has learned Italian learns early to pay attention to context when describing a broom, because the Italian word for this mundane domestic item has an extremely NSFW second meaning as a verb*. Though we learn early to disentangle the semantic mapping and (apposite) applicability of words with multiple meanings, this is not a skill that is easy to pass on to hyperscale image synthesis systems such as DALL-E 2 and Stable Diffusion, because they rely on OpenAI's Contrastive Language–Image Pre-training (CLIP) module, which treats objects and their properties rather more loosely (yet which is gaining ever more ground in the latent diffusion image and video synthesis space.
Studying this shortfall, a new research collaboration from Bar-Ilan University and the Allen Institute for Artificial Intelligence offers an extensive study into the extent to which DALL-E 2 is disposed towards such semantic errors:
The authors have found that this tendency to double-interpret words and phrases seems not only to be common to all CLIP-guided diffusion models, but that it gets worse as the models are trained on higher and higher amounts of data. The paper notes that ‘reduced' versions of text-to-image models, including DALL-E Mini (now Craiyon) output these kinds of errors far less frequently, and that Stable Diffusion also errs less – though only because, very often, it does not follow the prompt at all, which is another kind of error.
Explaining how we perform efficient lexical separations, the paper states:
‘While symbols – as well as sentence structures – may be ambiguous, after an interpretation is constructed this ambiguity is already resolved. For example, while the symbol bat in a flying bat can be interpreted as either a wooden stick or an animal, our possible interpretations of the sentence are either of a flying wooden stick or a flying animal, but never both at the same time. Once the word bat has been used in the interpretation to denote an object (for example a wooden stick), it cannot be re-used to denote another object (an animal) in the same interpretation.'
DALL-E 2, the paper observes, is not constrained in this way:
This property has been named resource sensitivity.
The paper identifies three aberrant behaviors exhibited by DALL-E 2: that a word or a phrase can get interpreted and effectively bifurcated into two distinct entities, rendering an object or concept for each in the same scene; that a word can be interpreted as a modifier of two different entities (see the ‘goldfish' and other examples above); and that a word can be interpreted simultaneously as both a modifier and an alternate entity – exemplified by the prompt ‘a seal is opening a letter':
The authors identify two failure modes for diffusion models in this respect: that the results of user prompts with sense-ambiguous words will often exhibit the concretized word together with some manifestation of the concept; and concept leakage, where the properties of one object ‘leak' into another rendered object.
‘Taken together, the phenomena we examine provides evidence for limitations in the linguistic ability of DALLE-2 and opens avenues for future research that would uncover whether those stem from issues with the text encoding, the generative model, or both. More generally, the proposed approach can be extended to other scenarios where the decoding process is used to uncover the inductive bias and the shortcomings of text-to-image models.'
Using 17 words that will cause DALL-E 2 to split the input into multiple outputs, the authors observed that homonym duplication occurred in over 80% of 216 images rendered.
The researchers used stimuli-control pairs to examine the extent to which specific and arguably over-specified language is necessary to stop these duplications occurring. For the entity-to-property tests, 10 such pairs were created, and the authors note that the stimuli prompts provoke the shared property in 92.5% of cases, whereas the control prompt only elicits it in 6.6% of cases.
‘[To] demonstrate, consider a zebra and a street, here, zebra is an entity, but it modifies street, and DALLE-2 constantly generates crosswalks, possibly because of the zebra-stripes’ likeness to a crosswalk. And in line with our conjecture, the control a zebra and a gravel street specifies a type of street that typically does not have crosswalks, and indeed, all of our control samples for this prompt do not contain a crosswalk.'
The researchers experiments with DALL-E Mini could not replicate these findings, which the researchers attribute to the lower capabilities of these models, and the likelihood that their reductive processes light on the most ‘obvious' interpretation of a sense-ambiguous word more easily:
‘We hypothesize that – paradoxically – it is the lower capacity of DALLE-mini and Stable-diffusion and the fact they do not robustly follow the prompts, that make them appear “better” with respect to the flaws we examine. A thorough evaluation of the relation between scale, model architecture, and concept leakage is left to future work.'
Prior work from 2021, the authors note, had already observed that CLIP's embeddings don't explicitly bind a concept's attributes to the object itself. ‘Accordingly,' they write. ‘they observe that that reconstructions from the decoder often mix up attributes and objects.'
* DALL-E 2 has some issues in this specific case. Inputting the prompt ‘Una donna che sta scopando' (‘a woman sweeping') summons up various middle-aged women sweeping courtyards, etc. However, if you add ‘in a bedroom' (in Italian), the prompt invokes DALL-E 2's NSFW filter, stating that the results violate OpenAI's content policy.
First published 20th October 2022.