The Crystal Ball Hypothesis in diffusion models: Anticipating object positions from initial noise

Yuanhao Ban¹, Ruochen Wang¹, Tianyi Zhou², Boqing Gong³, Cho-Jui Hsieh¹, Minhao Cheng⁴,

¹University of California, Los Angeles, ²University of Maryland, College Park,
³Google ⁴The Pennsylvania State University

Paper arXiv Code

🌐

Twitter

We discover the existence of trigger patches -- distinct patches in the noise space that trigger the generation of objects in the diffusion model. In the figure, we found that when replacing the target patch within another initial noise with the trigger patch, the injection position would generate an object.

Abstract

In this studym we explore the role of initial noises in diffusion models. In particular, we identify specific regions, termed trigger patches, that can induce object generation in the resulting images. To identify these patches, we train a detector that anticipates these patches from the initial noise. To explain the formation of these patches, we reveal that they are outliers in Gaussian noise, and follow distinct distributions through two-sample tests. After examining the interaction between prompts and initial noises, we find the misalignment between prompts and the trigger patch patterns can result in unsuccessful image generations. The study proposes a reject-sampling strategy to obtain optimal noise, aiming to improve prompt adherence and positional diversity in image generation.

How to locate Trigger patches: Post-generation Method

We generate a bunch of images using the same noise and find that the degree of dispersion of the object bounding boxes varies in the generated images. As shown in the figure, the objects in the top row are more dispersed than the bottom row. So there must exist a strong trigger patch in the orange bounding box in the bottom row noise, inducing the ball at the same region. To formulate the degree of dispersion, we first get the center points ($x_{c_i}$,$y_{c_i}$) of the bounding boxes and define trigger entropy as follows:

How to locate Trigger patches: Pre-generation Method

To anticipate the trigger patch before generation, we train a trigger patch detector, which functions similarly to an object detector but operates in the noise space. We perform the training in three different settings and the results are shown in the following bar chart. We find that when requiring the model to predict the object to be generated in the box, the mAP drops a lot, indicating that the trigger patches can generalize across classes.:

What contributes to the formation of trigger patches

We hypothesis that these trigger patches are typical outliers in the sampled noise, and have minimal correlation with prompts. We perform an energy-based two-sample test to compute the deviation of a patch from the original distribution. We divide 20,000 noises into ten groups according to the order of Trigger Entropy. Additionally, we craft a control group by randomly sampling from standard Gaussian distribution. We find that all the p-values between trigger patch groups and standard gaussian are 0.0, while the p-value of the control group is 0.938. Furthermore, the greater the deviation of a patch from the original distribution, the more likely it is to be a trigger patch characterized by low trigger entropy.

How do trigger patches and initial noises interact with each other

We wonder what happens when the prompts also contain positional information, and how do the prompt and noise interact with each other when they have aligned and contradicted object positions. So we test two initial noises with trigger patches on the left/right and two set of prompts with a template of a {class name} on the left/right. Note that there are three prompt-and-noise pairs. 1) Aligned: with a strong trigger patch on the right, 2) Contradicted: with a strong trigger patch on the left, and 3) Dispersed: with trigger patches dispersed throughout the image. After viewing the generated images, we divided them into four groups: 1) Aligned: The position of the object is aligned with the prompt guidance (On the right side). 2) Contradicted: The position of the object contradicts with the prompt guidance (On the left side). 3) Duplicated: the generated image has two objects on both sides. 4) Hard to judge (Occupying the entire picture, failed generation, or in the middle of the image) The results are as follows:

Application: Enhanced location diversity by removing trigger patches

Our detector can effectively mitigates positional bias, where objects tend to appear in the same location during generation. We employ a rejection sampling method that begins with detecting trigger patches in the initial noise. If the confidence scores of the bounding boxes exceed a predefined threshold, the region within the box is flagged for regeneration. This approach ensures that the noise used for generation is "pure" and free from strong trigger patches that could constrain object placement. The following figure illustrates cases with and without rejection sampling and the latter one shows greater diversity in position.

To perform a quantative analysis, we define entropy to access the diversity, which is similar to the trigger entropy, with higher values indicating greater diversity. The results are as follows:

Application: Pre-generation position control

Our findings can contribute to controllable generation: put the trigger patch to the place you want the object to appear. To test the effectiveness of this method, we replace an arbitrary patch in a noise map with the trigger patch and then use the blended noise for generation. We sort the patches in our dataset by trigger entropy and select one every two thousand for this test. We define the injection success rate (ISR) as the ratio of successful injection cases to the total number of cases. There are two baselines as follows. 1) Resampling: Resample Gaussian noise for the target region maintaining the same mean and variance. 2) Random: Select a random patch within a source noise, which might overlap with the trigger patch in the source noise. The results are as follows:

BibTeX

@article{ban2024crystal,
      title={The Crystal Ball Hypothesis in diffusion models: Anticipating object positions from initial noise},
      author={Ban, Yuanhao and Wang, Ruochen and Zhou, Tianyi and Gong, Boqing and Hsieh, Cho-Jui and Cheng, Minhao},
      journal={arXiv preprint arXiv:2406.01970},
      year={2024}
    }