Image Generation with ComfyUI and Stable Diffusion
raphaelsemeteys
419 views
27 slides
Dec 08, 2024
Slide 1 of 27
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
About This Presentation
Discover the power of local image generation through a gradual use case to generate yoga poses and learn everything about Stable Diffusion, text-to-image, image-to-image, embedding, LoRA and ControlNet!
Size: 2.44 MB
Language: en
Added: Dec 08, 2024
Slides: 27 pages
Slide Content
Get Comfy and Dream!
Controlled Image Generation
with Stable Diffusion
Hajer Mabrouk
Raphaël Semeteys
•Former Innovation Manager at Oracle
•Certified Somatic Coach and Yoga Teacher
•Head of DevRel, Senior Architect at Worldline
•Certified Yoga Teacher
Hajer Mabrouk Raphaël Semeteys
linkedin.com/in/ hajer-mabrouk/ raphiki.github.io
Use Case
•YogĀrkana is a website dedicated to yoga
•Descriptions and images of yoga poses
must be precise
Locally generate accurate images of Yoga Poses
•Photography is not always the best option
•Images or photos from the internet cannot
be reused
How could Generative AI help?
Stable Diffusion
From German Labs to London-based Startup
•Collaboration of several companies and German Labs
•Latent Diffusion Model with embedding space in 2021
•CLIP-guided diffusion
•LIAON dataset
•Runway and EleutherAI participation
•Stability AI
•Compute donation to the project
•Hired most of initial researchers
•Now official maintainer of Stable Diffusion models
“Open” Licenses
•Responsible AI: OpenRAIL
•Version 3.5: Enterprises with
1M+ revenue must pay
Stable Diffusion
Very dynamic contributing Communities
Models
•Fine-tuning: Custom Models for
specific styles or themes
•Refiners, Upscalers, ControlNets
•Model extensions (LoRA)
Tools
•User-Friendly Interfaces:
Automatic1111 Web UI, ComfyUI
•Fine-tuning tools: Dreambooth,
Kohya SS
Sharing Communities
•Portals to share models, prompts, images and tutorials: Hugging Face, Civit.ai…
•Stable Horde: crowdsourced distributed cluster of generation workers
GUI for local Stable Diffusion workflows
•Intuitive, modular and customizable
•Flexible node-based workflows
•Text-to-Image Generation
•Image-to-Image Processing
•Custom Node Management
•Community-Driven
•GPL 3 License
•Contributions, plugins, doc
Let’s start with a simple demo
Generate a image of a girl
doing a yoga pose
What does Stable Diffusion?
•Starts with Random Noise: Begins with a noisy,
unrecognizable image
•Refines Step-by-Step: Gradually removes noise,
adding details
•Learns from Real Images: Uses patterns from
trained images
•Text-Guided Creation: Follows prompts like "sunset
over mountains"
•Denoising Process: Clarifies image layer by layer
•Final Image Output: Produces a clear, detailed
image matching the prompt
Source: wikipedia
Models
•Most used Stability.ai Models
•SD 1.5
•SDXL
•Fine-tuned Models
•Specialized: style, subject
•Shared by communities (like civitai.com)
Prompts
•CLIP Model (Contrastive Language–Image Pretraining)
•Connect descriptive text and images
•Help generate images matching specific prompts
•Can handle a wide range of prompts
•Developed by OpenAI in 2021
•Usable under the MIT license
•Trained on 400M image-text pairs from the Internet
•Positive & Negative
•Textual or Short syntax
Embeddings (Textual Inversions)
•Vector representations of text
•“Instructions” for image generation
•Style, theme, texture, pose, character features, etc.
•Small files containing additional concepts
•To be injected in prompts
•Community provides many presets
•Must aligned with Stable Diffusion version
Embeddings (Textual Inversions)
GhibliFantasyComic3D Render
Analog Film Cinematic Cyberpunk Digital Art
No Embedding
Vector Art
Latent Space
•Latent Space
•Abstract, compressed representation of the image
•Handles encoded features such as shapes, colors,
textures and general structure
•Manipulation of embedding vectors
•Iterative and refining generation
•Random noise is introduced into the latent space
•At each step the model adjusts the features to match
the prompt
•VAE (Variational Autoencoder)
•Convert Image Pixels → Latent Space
Denoising Process
•Seed
•Random seed used to create initial noise
•Fixing it allows to see impact of other parameters
•Samplers
•Algorithms guiding the iterative image generation
•Differ in Speed and Quality
•Schedulers
•Control how noise is removed at each step
•Also impact Speed and Quality, Karras is well balanced
•Other Parameters
•#steps, CFG (adherence to prompt), %denoising
I can tweak generation
but I don’t control the pose…
Camel poseTree poseLotus pose Shoulder Stand pose
Text-to-Image generation is not enough!
Let’s move on to Image-to-Image
Generate a image of a girl
doing a yoga pose based on
an existing image
Image-to-Image Generation
CFG 20 CFG 8
•Input Image
•Replace the Empty Latent Image with a real one
•Need a VAE Encode (from the model)
•Play with % denoising
•Prompt has less impact
•Increasing CFG only reduces quality
denoise 0.55 denoise 0.70
ControlNets
•Specialized Neural Networks
•Additional control and guidance to primary model
•Use reference images to transfer structural information
or inject features
→ Hybrid approach with both text and visual references
•Control methods
•Structural: pose, edge detection, segmentation, depth
•Texture & Detail: scribble/sketch, stylization from edges
•Content & Layout: bounding boxes, inpainting masks
•Abstract & Style: color maps, textural fields
Depth ControlNet
Preprocessors for ControlNets
Initial Image Line Art Color Map Open Pose
Segmentation Depth Map ScribbleStraight Lines
More abstract input images
•Design poses in 3D with image export
•Use of JustSketchMe tool (webapp & PWA)
•Design poses based on my own knowledge
•Several angles of view
•(waiting for 3D GenAI Models)
How can I achieve greater
consistency for the character?
Create images featuring the
same facial identity
LoRA
•Low-Rank Adaptation
•Lightweight Model Adaptation
•Update a small subset of model parameters
•Very efficient
•Small File Size, use significantly less memory
•Faster Training
•Usage
•Specific styles, poses, characters, or concepts
•Triggered by keywords in the prompt
•Many LoRAs are provided by the community
Examples of generated images
Controlled generation for Diversity & Inclusion
Controlled generation for Diversity & Inclusion
Conclusion
•Image Generation is both Science and Art
•A lot of parameters to tune
•Additional inputs and components to control generation
•Our use case is implementable
•Nicer and homogeneous images for YogĀrkana
•Cheery on the cake: a more inclusive Website!
•Next steps
•Create our own LoRA, test video generation
•Explore voice generation for i18n and more inclusivity
From
to
&