Written by Ethan Smith
While SDXL has greatly improved adherence to prompts, there are still several notable places where things fall short. Some things that I think are somewhat inherent to using clip as the text condition.
Here’s a couple of failure cases I’ll roughly formalize:
You’ve probably seen it happen. You ask for “A cat with green eyes resting upon a pair of shoes” and the green infiltrates into other parts of the scene.
“Photo of a red sphere on top of a blue cube. Behind them is a green triangle, on the right is a dog, on the left is a cat”