Prompt engineering is hardRead time in minutes: 11
I've seen a lot of comments on Twitter that seem to completely misunderstand the process of getting a decent result with AI generators like Stable Diffusion and DALL-E 2. People seem to assume that it's just "push button, recieve bacon" without any real creativity in the equation. As someone who has done a lot of this experimentation in the past few months, I'd like to challenge that assertion and show you what the process for getting a decent result actually involves.
First, you need to start off with a vision for what you want. I'm going to pull my fictional world Malto, specifically an area named Kanar. It is a very green area, lots of bamboo and the local architecture takes advantage of it. The area is fairly wealthy because they take advantage of their weird soil composition in order to produce the plants that help them make an alcoholic beverage that the nobles all over the world can't get enough of.
From here, I like to get the "vibe" of the image down first. I think that Waifu Diffusion would be a good model to use for this, mostly because you can feed it danbooru style tags to prompt the image you want. I also kind of want a studio ghibli feeling, and Waifu Diffusion has proven to be very good at that.
My first iterations start out with a few 512x512 images with a few vibe prompt keywords to get basic thoughts onto the "canvas".
Here is my starting prompt:
bamboo bamboo_forest grass studio_ghibli hayao_miyazaki happy peaceful summer
I'm not a big fan of these. I want a landscape, but it instead showed me a
bamboo forest from the inside. There were also some subjects in frame in a few
of those too. We're not focusing on subjects yet. Let's remove the
bamboo_forest tag and add the
bamboo landscape pagoda grass studio_ghibli hayao_miyazaki happy peaceful summer
This is a lot closer to what I want. I'm going to stick with this seed too,
320353. Now that we have a seed that's better, let's increase the resolution
to 1280x512 to see how that changes things. The AI natively draws images in
512x512 chunks, so the jump to something larger can be weird at times.
That came out way better than expected! Usually doing the jump to 1280x512 causes some cursed results and mutant hell creatures. This is actually somewhat decent, but I want something a bit more stylized. This is roughly correlating to the image I have of Kanar in my head, but this looks closer to a drawing of the area done by an outsider rather than how they would portray themselves. I'm going to add a few style keywords:
bamboo landscape pagoda grass studio_ghibli hayao_miyazaki happy peaceful summer ukiyo-e wood-block
Something fun you can do is add "masterpiece" to make the image look better and "unreal engine" to improve the lighting. Let's mess with that here:
bamboo landscape pagoda grass studio_ghibli hayao_miyazaki happy peaceful summer ukiyo-e wood-block unreal engine masterpiece very beautiful
You know what, I'm not feeling that wood-block style. Let's remove it in the
next round. The main thing we need to focus on next is the subject. The main
export of Kanar is a rice-based alcoholic drink. I'm going to add "rice_paddy"
to the prompt just after
bamboo landscape pagoda grass rice_paddy studio_ghibli hayao_miyazaki happy peaceful summer ukiyo-e unreal engine masterpiece very beautiful
I think that last one is going to be the image that I'm going with. Let's see what happens if we change the time of day:
...well that didn't change the time of day at all. I think I like the nighttime result though, so I'm going to go with that. Here's the image we have so far:
This has a bit of an imposing feeling, maybe like a castle or some other important building. Maybe this is the private pagoda of their leader. What if we add some guards?
nighttime bamboo landscape pagoda grass rice_paddy studio_ghibli hayao_miyazaki happy peaceful summer ukiyo-e unreal engine masterpiece very beautiful guards pikemen
I really like this. This is the kind of vibe that I'm going for. I want something that makes me feel like I'm looking into that area that I've only ever seen in my mind's eye from description paragraphs and topological charts. This is the kind of thing that Stable Diffusion and similar models let you do as a writer: they let you bring images out of your head and onto the canvas so that you can have people really understand what it's like. If I wrote a longer story set in here, I'd probably throw this image and a few others generated with different seeds to an artist to help me make an image for a book cover.
I'm also not really sure why people call this "prompt engineering", I'd personally rather call it "scrying", but I can understand why Silicon Valley culture would push everything towards being "engineering". I just legally can't call myself an "engineer" in Canada without an engineering degree.
This is the kind of technology I am really excited for, and I can't wait to see how this evolves. Computers are fun sometimes.