240603_Thanh_LabSeminar[Imagine That! Abstract-to-Intricate Text-to-Image Synthesis with Scene Graph Hallucination Diffusion].pptx

thanhdowork 95 views 25 slides Jun 03, 2024
Slide 1
Slide 1 of 25
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25

About This Presentation

Imagine That! Abstract-to-Intricate Text-to-Image Synthesis with Scene Graph Hallucination Diffusion


Slide Content

Imagine That! Abstract-to-Intricate Text-to-Image Synthesis with Scene Graph Hallucination Diffusion Tien-Bach-Thanh Do Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: os fa19730 @catholic.ac.kr 202 4/06/03 Abhra Chaudhuri et al. NeurIPS 2023

Introduction Figure 1: (a) An example of abstract-to-intricate T2I synthesis. All images are generated by the Latent diffusion Model (LDM) [42]. LDM fails to accurately render the abstract contexts, e.g., ‘give a presentation‘ and ‘office‘ of the original prompt. Raw prompts can be enriched via descriptive insertion [8], or addition [48]. Enriched contexts are in blue. (b) We illustrate the human intuition on the abstract-to-intricate T2I process: we always first grasp the semantic structure of the original prompt text, i.e., scene graph (SG), and then carry out imagination with more complete scenes based on the SG. Here the glowing nodes and edges are enriched ones.

Methodology Figure 2: Overall framework of our proposed SG-based hallucination diffusion system (Salad)

Methodology Scene Graph Representation Object node: {o 1 , ..., o N } where on denotes n-th object node Attribute node: {a 1,1 , …, a N,M } where a n,m means m-th attribute node of the n-th object node Relation node: {r 1,1 , …, r N,N } where r i,j means object node o i connects to object o j All nodes come with category label l

Methodology Diffusion models DMs learn to convert Gaussian distribution into data distribution DMs consist of forward (diffusion) process and a reverse (denoising) process Given data x0 ~ q(x0) is gradually corrupted into an approximately standard normal distribution xT ~ p( xT ) over T steps by increasing adding noisy, formulated as The learn reverse process To improve the fit of a generative model to the data distribution, a variational upper bound on the negative log-likelihood is optimized where p0(.) is estimated by a denoising network, eg. time-conditional U-Net

Methodology Scene Graph Hallucination SGH as a discrete denoising diffusion process SG of gold image as G0 will be corrupted into a sequence of increasingly noisy latent variables G 1:T = {G 1 , G 2 , …, G T }, where each SG node s tj * ∈Gt, * ∈ {o, a, r} (t is diffusion step, j is the node index) takes a discrete value with K* category labels The discrete diffusion process can be parameterized with multinomial categorical transition matrix where B(s t ) denotes the column one-hot vector of s t , Q t is transition matrix with [Q t ] ij = q(s t = j| s t-1 = i) representing the probabilities that s t-1 transitions to s t Due to the property of Markov chain, the cumulative probability of st at arbitrary timestep from s0 can be derived as Employ a mask-and-replace strategy to design the Qt. 3 probabilities: 1) a probability of yt to transition to [MASK] node, 2) probability of KBt be resampled uniformly over all the K categories, 3) probability of to stay the same node The transition matrix Qt can be formulated as

Methodology Scene Graph Hallucination SG decoder as the neural approximator to estimate the distribution Employ an adaptive normalization (AdaLN) to inject the timestep information Text cross-attention (Text-CA) integrates the input prompt y Graph cross-attention (Graph-CA) is devised to take in the induced SG (Gt+1) in the previous t+1 timestep where H* are the features yielded from Text-CA. Design an node-type dependence cross attention for the *-type node induction

Methodology Scene-driven Image Synthesis Figure 3: Hierarchical scene integration (HSI) fuses the SG features under multiple levels: 1) objects (with attributes), 2) relational triplets (i.e., subject-predicate- object), 3) regional neighbors, and 4) the whole SG

Methodology Scene-driven Image Synthesis Design hierarchical scene integration strategy to ensure the highly effective integration of SG features Consider the fusion at 4 different hierarchical levels Maintain the representations of these levels as the keys and values via CLIP encoder, which are integrated together via the Transformer attention of U-Net in Latent diffusion model where H is the visual query vectors from the ResNet block in LDM By denoising T steps, the system finally produces the desired image

Methodology Training SGH is separately updated via L SGH based on the abstract-to-intricate SG pair annotation, until it has converged. SIS and SGH modules are optimized jointly by minimizing Optimize SIS with surrogate objective [23] calculates the MSE loss, Gt is intermediate SG by SGH at timestep t, which can be derived from the s * t , e is the noise in SIS, and e0(.) denotes the U-Net

Methodology Inference with Scene Sampling Aim to endow the SGH with diversified SG enrichment, and lead to T2I diversification Given an abstract prompt, more than 1 possibility of the potential scenes to imagine Diffusion model has a larger potential of divergence only at its earlier stage, while the generation tends to be more stable and certain when the iteration grows Design a scene-sampling mechanism Take the top-A category candidates with corresponding probability distribution based on the category distributions of node , perform sampling over these candidates with dynamic probability n is a temperature. When t=T (start denoising) more random sampling is preferable, while t approaches 0 (denoising ends), SGH tends to be more decisive Figure 4: Illustration of the scene sampling mechanism.

Experiments Settings Conduct T2I generation experiments mainly on COCO dataset Abstract-to-intricate SG pair annotations for training the SGH module, employ an external textual SG parser and a visual SG parser on the paired images and texts in COCO to obtain the initial SG and imagined SG To enlarge the abstract-to-intricate SG pairs, extend Visual Genome (VG) Compare with 3 types of existing T2I models GAN-based models: AttnGAN, ObjGAn, DFGAM, OPGAN Auto- aggressive model: DALL-E, CogView Diffusion-based models: LDM, VQ-diffusion, LDM-G, Frido Enriched text prompts are then utilized as inputs for Frido to generate the final images, adopt 3 standard metrics to measure image synthesis performance Inception score Frechet Inception Distance CLIP score GLIP to measure the fine-grained object-attribute grounding in images Triplet Recall measure subject-predicate-object between 2 SGs Learned Perceptual Image Patch Similarity

Experiments Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval The task of parsing image descriptions to scene graphs is defined as following. Given a set of object classes C, a set of relation types R, a set of attribute types A, and a sentence S we want to parse S to a scene graph G = (O, E). O = {o1, ..., on} is a set of objects mentioned in S and each oi is a pair (ci, Ai) where ci ∈ C is the class of oi and Ai ⊆ A are the attributes of oi. E ⊆ O × R × O is the set of relations between two objects in the graph. For example, given the sentence S = “A man is looking at his black watch” we want to extract the two objects o1 = (man, ∅) and o2 = (watch, {black}), and the relations e1 = (o1, look at, o2) and e2 = (o1, have, o2). The sets C, R and A consist of all the classes and types which are present in the training data

Experiments Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval We implement two parsers: a rule-based parser and a classifier-based parser. Both of our parsers operate on a linguistic representation which we refer to as a semantic graph. We obtain semantic graphs by parsing the image descriptions to dependency trees followed by several tree transformations. In this section, we first describe these tree transformations and then explain how our two parsers translate the semantic graph to a scene graph. A Universal Dependencies parse is in many ways close to a shallow semantic representation and good starting point for parsing image descriptions to scene graphs.

Experiments Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval Our rule-based parser extracts objects, relations and attributes directly from the semantic graph. We define in total nine dependency patterns using Semgrex2 expressions. These patterns capture the following constructions and phenomena: With the exception of possessives for which we manually add a have relation, all objects, relations and attributes are words from the semantic graph. For example, for the semantic graph, the subject-predicate-object pattern matches man nsubj ←−−− riding dobj −−→ horse and man′ nsubj ←−−− riding′ dobj −−→ horse′. From these matches we extract two man and two horse objects and add ride relations to the two man-horse pairs. Further, the possesive pattern matches man nmod:poss ←−−−−−− horse and man′ nmod:poss ←−−−−−− horse′ and we add have relations to the two man-horse pairs.

Experiments Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval Our classifier-based parser consists of two components. First, we extract all candidate objects and attributes, and second we predict relations between objects and the attributes of all objects.

Experiments Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval In a first step we extract all nouns, all adjectives and all intransitive verbs from the semantic graph. As this does not guarantee that the extracted objects and attributes belong to known object classes or attribute types and as our image retrieval model can only make use of known classes and types, we predict for each noun the most likely object class and for each adjective and intransitive verb the most likely attribute type. To predict classes and types, we use an L2-regularized maximum entropy classifier which uses the original word, the lemma and the 100 dimensional GloVe word vector (Pennington et al., 2014) as features.

Experiments Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval The last step of the parsing pipeline is to determine the attributes of each object and the relations between objects. We consider both of these tasks as a pairwise classification task. For each pair (x1, x2) where x1 is an object and x2 is an object or an attribute we predict the relation y which can be any relation seen in the training data, or one of the two special relations IS and NONE which indicate that x2 is an attribute of x1 or no relation exists, respectively. We noticed that for most pairs for which a relation exists, x1 and x2 are in the same constituent, i.e. their lowest common ancestor is either one of the two objects or a word in between them. We therefore consider only pairs which satisfy this constraint to improve precision and to limit the number of predictions. For the predictions, we use again an L2- regularized maximum entropy classifier with the following features: Object features: The original word and lemma, and the predicted class or type of x1 and x2. Lexicalized features: The word and lemma of each token between x1 and x2. If x1 or x2 appear more than once in the sentence because they replace a pronoun, we only consider the words in between the closest mentions of x1 and x2 Syntactic features: The concatenated labels (i.e., syntactic relation names) of the edges in the shortest path from x1 to x2 in the semantic graph

Experiments Neural Motifs: Scene Graph Parsing with Global Context

Experiments Neural Motifs: Scene Graph Parsing with Global Context A scene graph, G, is a structured representation of the semantic content of an image. It consists of: a set B = {b1, . . . , bn} of bounding boxes, bi ∈ R4 a corresponding set O = {o1, . . . , on} of objects, assigning a class label oi ∈ C to each bi a set R = {r1, . . . , rm} of binary relationships between those objects. Each relationship rk ∈ R is a triplet of a start node bi, oi) ∈ B × O, an end node (bj , oj ) ∈ B × O, and a relationship label xi→j ∈ R, where R is the set of all predicate types, including the “background” predicate, BG, which indicates that there is no edge between the specified objects

Experiments Neural Motifs: Scene Graph Parsing with Global Context

Experiments Settings Maximum number of SG object nodes as 30 Each object node has maximum of 3 attributes Timestep as 100 SIS module load the parameters of Stable Diffusion as initialization Use CLIP as text encoder Optimize by AdamW with B1 = 0.9 and B2 = 0.98 Learning rate = 5e-5 after 10000 interaction of warmup Attention layer in SG decoder and UNet in SIS: 4 layers 8 attention heads, 512 embedding dimensions, 2948 hidden dimensions, 0.1 dropout rate Follow prior works to acquire scene visual scene graph from the gold image and textual scene graph from text prompt

Experiments Settings

Experiments Settings Figure 5: Qualitative results by different methods, where the given prompts randomly selected from the test set are short and abstract expressions (marked in red), while describing intricate visual scenes.

Conclusion Explore the text-to-image synthesis task under the abstract-to-intricate setup Propose a scene-graph hallucination mechanism, carries out scene imagination based on the initial scene graph of the input prompt, expanding the starting SG with more possible specific scene structures Develop SG-based hallucination diffusion system for the abstract-to-intricate T2I, which mainly includes an SG-guided T2I module and an SGH module Design SGH module based on the discrete diffusion technique, evolves the initial SG structure by iteratively adding new scene elements Utilize another continuous diffusion model as the T2I synthesizer, where the over image-generating process is navigated by the underlying semantic scene structure induced by SGH module SG-based hallucination mechanism is able to generate logically sound SG structures, which in return helps produce high-quality scene-riched image