In this blog post, we intorduce three ways of integrating visual information into LLMs. We also conduct some detailed analysis and comparison among them.
We did a blog post to introduce 3 existing type of integrating vision modality into LLM. Additionally, we also conduct both qualitative and quantitative analysis between three methods.
Recent papers have shown the success of adapting LLMs to vision-language tasks with a small amount of fine-tuning required. Various methods are proposed for bridging visual and text modalities. Nonetheless, the comparison and analysis among them are rarely studied. To this end, we plan to write a blog post to give a kind introduction and summary of recent approaches for vision-language adaption for LLM. Although there are lots of related works in recent years, they basically fall into three categories: query-based, projection-based, and parameter-efficient tuning approaches (following definitions in
In this blog post, we first give detailed introductions to each representative work individually. Then we show some qualitative and quantitative analysis done by ourselves and provied some conclusion from those results.
Categories | Query-based | Projection-based | Parameter-Efficient Tuning |
---|---|---|---|
Selected model | InstructBLIP | LLaVA | LLaMA-Adapter |
Is the extracted image features conditioned on text? | ✔ | ❌ | ❌ |
Where are the two modalities concatenated? | At the input of LLM | At the input of LLM | At each adapters in multiple layers of LLM |
Is the model pretrained on image-caption dataset? | ✔ | ✔ | ❌ |
Trainable Params | 188M | 7B | 1.2M |
ScienceQA Results (Acc %) | 79.5 | 90.9 | 85.2 |
The upper two rows(Is the extracted image features conditioned on text?,Where are the two modalities concatenated?) of the table are the main concept to distinguish between the three categories.
In the following subsection, we will introduce the details about each selected model.
In this section, we choose BLIP-2
Q-Former is a trainable module to bridge the gap between a frozen image encoder and a frozen LLM. There are two transformer submodules (1) an image transformer and (2) a text transformer that share the same self-attention layers. A set of learnable queries is sent as input to the image transformer, interacting with frozen image features through cross-attention layers. Q-Former is pre-trained in two stages as belows:
Extract visual representation that is most informative of the text.
In the representation learning stage, Q-Former is connected to a frozen image encoder and perform pre-training using image-text pairs. The goal is to enable queries to extract visual representation that is most relevent to the text. There are three learning objectives, each one employs different attention masking strategy between queries and text.
Contrastive learning on the output of image and text transformer.
It learns to align image and text representations by contrasting the image-text similarity of a positive pairs against negative pairs. Specifically, it aligns the output query representation Z and output [CLS] token in text transformer. Then, compute pair-wise similarity between each query output and [CLS] and select the highest one as the image-text similarity.
Given image as condition, generate the text.
It learns to generate the text from image features extracted by the queries. So the queries are forced to extract visual features that capture all the information about the text. Here it employs causual self-attention mask where the queries cannot attend to the text tokens while the text tokens can attend to all queries and previuos text tokens.
Binary classification whether an image-text pair is matched.
It learns the fine-grained alignment betwen image and text. It uses a bi-directional attention where all queries and text tokens can attend to each other and the output query embeddings with multimodal information will be fed into linear classifier to predict whether the image-text pair is matched(positive) or unmatched(negative).
In the generative pre-training stage, we connect QFormer(with the frozen image encoder attached) to a frozen LLM. The output query embeddings are projected to the same dimension as the text embeddings of the LLM. Then, the projected query embeddings are prepended to the input text embeddings as ‘soft prompt’. There are two types of LLMs: decoder-based and encoder-decoder based as the below figure shows.
In this paper, the authors conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models.
In this section, we chose the works LLaVA
The authors utilized language GPT-4 model to generate visual instruction tuned data for (image, text) pairs by prompting the GPT-4 model with features of image such as description of scenario and objects in the image and bounding box data of the objects in the image. Then, the model is seeded with a couple of manually curated data. Three type of instruction-following data is collected:
A conversation between assistant and a person asking questions about a given photo. The answers are in a tone as if the assistant is seeing the image and answering the questions. Questions asked include object types, counting the objects, object actions, relative positions between objects.
A list of questions is curated that prompt GPT-4 to answer in a more detail way about the image. The questions include :
- Describe the following image in detail
- Provide a detialed description of the given image
- Analyze the image in a comprehensive and detailed manner
This type of questions require answers that require a step-by-step reasoning process by following rigorous logic.
As LLaVA model relies on off-the-shelf pretrained vision and language models and these models maps their input to a separate high dimensional space. However, to effectively leverage their capabilites (i.e., jointly use the information captured by vision and language embeddings), the embeddings have to be mapped closer in the same higher dimensional space. For an input image \(X_v\), a pre-trained CLIP visual encoder ViT/L-14 is used to extract the visual features \(Z_v = g(X_v)\). The model uses grid features before and after the last Transformer layer for experiments. To convert, the extracted visual features and to provide conditioning to the text, the features are mapped to the space of text token embeddings via a single linear layer. Specifically, \(Z_v\) is converted to \(H_v\) via the learnable matrix \(W\) i.e., \(H_v = W * Z_v\).
The data to instruct tune the model is generated as: For each image \(X_v\), multi-turn conversation data is generated as a sequence \(( X_q^1, X_a^1 , \dots, X_q^T, X_a^T )\) where \(T\) represents the total number of turns. At iteration \(t\), the model is given \(X_{instruct}^t\) which is defined as :
\(X_{instruct}^t = \begin{cases} \text{Random choice } [X_v, X_q^1] \text{ or } [X_q^1, X_v] & t=1 \\ X_q^t & t > 1 \\ \end{cases}\)
Instruction tuning is performed on LLM on the prediction tokens via the original auto-regressive training objective. Specifically, model is fine tuned for the loss signal that is generated on predicted answer tokens. The model training is done in two steps:
The experimental analysis showed that LLaVA achieved SOTA performance on mean accuracy on ScienceQA compared to other methods such as Chain-Of-Thought (COT) and LLaMA-Adapter.
Contributions of LLaVA-1.5
The base architecture of LLaVA is kept intact but the following modifications are made:
The authors claim that LLaVA model was falling short on academic benchmakrs that typically require short-form answers. They attribute this to the fact that LLaVA has not bben pretrained on large-scale data as other approaches do. The following image studies the scaling effect of data, model and image resolution on a selection of three datasets given in the following table.
Moreover, the authors proposed that to control the length of prompted answer of LLaVA model, they explicitly state that information in the prompt during its fine-training stage which can help the model to learn to control the length of output response. Apart from this, the model is fine-tuned on academic-task-oriented VQA datasets such as open-knowledge VQA, OCR VQA, and region level VQA. Moreover, a two linear layer MLP architecture is used for projecting visual features to text token embedding space. Furthermore, image size is scaled up to $336$ px and LLM model size is scaled to $13$ B.
Based on the previous additions to the model, the performance on a total of $12$ benchmarks of academic VQA benchmarks specifically proposed for instruction following LMMs showed that LLaVA-1.5 achieved SOTA performance across $11$ out of $12$ benchmarks.
In this section, we choose the LLaMA Adapter
In the LLaMA Adapter, they propose a new training method Zero-init Gated Attention. When fintuning with adapters in the early training stage, it is often unstable. The reason is that the pretrained model has not yet learned how to utilize the newly injected adapter modules. To this end, in the LLaMA adapter, the authors introduce a gating factor \(\mathbf{g}_l\) on the adapter part of the attention score before multiplying the values in the attention layer. Their ablation studies further substantiate the advantage of this proposed method.
The contributions of LLaMA can be summarized as the following:
To mitigate early-stage disturbance of adapter prompt. The author introduces a learnable gate factor on the attention score of adaption prompts’ positions. Specifically, let’s take a closer look at the attention layer of LLaMA.
Suppose we have \(K\) adaption prompts prepended at the beginning of the original sequence (length=\(M+1\)), \(C\) indicates the hidden dimension of the model. Now, let’s consider the last timestep for attention calculation. \(\mathbf{Q}_t\) is the query vector of the last timestep, \(\mathbf{K}_l\) are the key vectors of the entire input sequence (\(K+M+1\)), \(\mathbf{V}_l\) are the value vectors of the entire input sequence (\(K+M+1\))
To calculate the attention score for the last timestep query to all keys \(\mathbf{S}_l\), we simply do a dot product \(\mathbf{Q}_l\) and \(\mathbf{K}_l\) and normalize it by \(\sqrt{C}\).
Notice that the upper part(first \(K\) rows) of the attention scores \(\mathbf{S}_l\) is affected by adapter prompt (\([\mathbf{P}_1...\mathbf{P}_k]\)) while the rest of them aren’t. To let the model gradually utilize the adaption prompt, the author multiplies the upper part with a learnable gating factor \(\mathbf{g}_l\). The softmax function is applied separately for the upper and lower parts of \(\mathbf{S}_l\) rather than on the whole vector. The reason is that we don’t want the values on the lower part(original attention) to be affected by the values on the upper part(adapter prompts).
Finally, the modified attention scores are then used to perform a weighted sum over the entire values vector sequence to get the final hidden state for the last timestep. To sum up, if the gating factor is 0, it is an ordinary attention calculation. The author initialized the gating factor as 0, which is also the reason why it’s dubbed as zero-init attention.
The author further conducted some ablation experiments to justify the effectiveness of zero-init attention.
The final performance on ScienceQA
There is a performance gap about 43% on ScienceQA between w/ and w/o zero-init attention.
Setting | Val Acc (%) |
---|---|
Rand-Init Attention | 40.77 |
Zero-Init Attention | 83.85 |
Gain | +43.08 |
Robustness to overfit
As the model trains more epochs on ScienceQA, we see that it hardly shows overfitting.
Epoch | Train Loss | Val Loss | Val Acc (%) |
---|---|---|---|
15 | 0.022 | 0.136 | 82.08 |
30 | 0.004 | 0.241 | 83.85 |
60 | 0.001 | 0.282 | 83.94 |
Convergence
The loss curve is also shown in the paper. Zero-init attention not only converge faster but also lower.
In addition to the analysis provided in the manuscript, we are curious how the learning factor grows throughout the training process. Hence, we train the LLaMA-Adapter for 5 epochs on the Alplaca 52K Instruction dataset and we visualize the gating factor for each layer and each head. Notice that we only draw the absolute value of the gating factor since only the magnitude matters. We create an interactive visulization in the following.
As expected, the gating factor gradually grows throughout the training process. We also observed a trend that the gating factors in the upper layers tend to have a higher value. This may be reasonable because the representations in the upper layers are more task-specific. So the role of adapters is more crucial to them.
The results for multimodal QA is in the following table. LLaMA-Adapter out perform previous LLM on ScienceQA.
In this section, we perform qualitative analysis by utilizing various images and prompts to explore distinct forms of visual reasoning across three methods. Within this analysis, we categorize our visual reasoning into six distinct types, as outlined below:
Emotion / Atmosphere Inference:
In this category, we present the model with an image carefully chosen to evoke specific emotions or atmospheric qualities. The challenge for the model is to understand the underlying emotional tone or ambiance depicted in the image. This category tests the model’s ability to utilize visual cues such as lighting, colors, or even implicit reasoning to analyze the emotional atmosphere captured in the image.
Create a Backstory:
In this category, we prompt the model to construct a compelling and detailed backstory for the characters, places, or objects depicted. This category tests the model’s creativity and imagination to contextualize visual information and create engaging stories using the visual cues provided in the image.
Predict Future:
In this category, we prompt the model to predict the future. It evaluates the model’s foresight and its ability to infer future scenarios, demonstrating its understanding of causal relationships and the dynamics of the depicted scene.
Explain Object-Object Relationship:
In this category, the model is prompted to reason the connections, interactions, or dependencies between different objects. This category tests the model’s ability to reason the probable relationship between objects in the image.
Explain Human-Object Relationship:
In this category, the model is prompted to reason the connections, interactions, or dependencies between humans and objects. This category tests the model’s ability to understand gestures, expressions, and body language of human and their interactions with surrounding objects in the image.
Explain Human-Human Relationship:
In this category, the model is expected to capture the nature of human’s relationship — whether it is friendly, adversarial, familial, romantic, or professional. This task assesses the model’s proficiency in understanding human emotions and social cues, enabling it to discern complex human relationships.
Confusing Image:
In this category, the model is presented with an confusing image that might be unusual compared to commonsense. This task evaluates the model’s capacity to capture the uncommon part and interpret it.
Implicit Relationship Inference:
In this advanced category, the model is presented with subtle visual cues and need to infer the implicit relationships that are not immediately apparent. This evaluates the model’s ability of in-depth thinking and complex visual understanding.
Emotion/Atmosphere Inference
Prompt for picture: Describe the emotions and atmosphere depicted in the image. Explore the feelings that this setting might evoke in a person and elaborate on the ambiance of the room.
Confusing Image
Prompt for picture: What is the unusual of this picture?
Explain Human-Human Relationship
Prompt for picture: What is the relationship of these two people?
Predict Future
Prompt for picture: What might happen next?
Create a Backstory
Prompt for picture: Invent a detailed backstory for the abandoned old house.
Explain Human-Object Relationship
Prompt for picture: What is this picture about? What’s the feeling of the people and why they have such feelings?
Explain Object-Object Relationship
Prompt for picture: Explain the relationship between the magnifying glass and the antique map.
LLaVA-1.5 (7B) Response: The magnifying glass is placed over the antique map, which is a vintage, hand-drawn map of the world. The magnifying glass is used to examine the details of the map more closely, providing a better understanding of the geographical features and information displayed on the map. The magnifying glass enhances the visual experience by allowing the viewer to see the intricate details and nuances of the antique map, which might not be easily discernible without the aid of the magnifying glass. This combination of the magnifying glass and the antique map creates a visually engaging and educational experience for the viewer.
Implicit Relationship Inference
Prompt for picture: What is this picture about?
The following table summarizes the performance of the models on different prompts (ideas/concepts). If the model correctly covers what the question asks then, we evaluate it as a success and if the model starts hallucinating or even gets stuck or gives an irrelevant answer then, we mark that as a failure.
Model | Emotion / Atmosphere Inference | Create a Backstory | Predict Future | Explain Object-Object Relationship | Explain Human-Object Relationship | Explain Human-Human Relationship | Confusing Image | Implicit Relationship Inference |
---|---|---|---|---|---|---|---|---|
InstructBLIP | ❌ | ❌ (Lacks creativity) | ✔ | ✔ | ❌ | ✔ | ❌ | ❌ |
LLaVA-1.5 | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
LLAMA-Adapter | ✔ | ✔ | ✔ | ✔ | ❌ | ✔ | ❌ | ❌ |
To test the robustness of a multi-modal model, we provide a prompt that is completely unrelated to the image. This will evaluate the model’s ability to focus on the provided textual input, ignoring the irrelevant visual context. There are two cases, with or without the hint that the image is unrelated to the picture.
Prompt for picture: Where is the man with the red hat?
LLaVA-1.5 (7B) Response:The man with the red hat is not visible in the image. The focus is on the open umbrella and the raindrops falling around it.
Prompt for picture: Where is the man with the red hat? Note that the image might be unrelated to this question.
LLaVA-1.5 (7B) Response: The man with the red hat is not visible in the image. The image features a close-up of a black umbrella in the rain, with water droplets falling on it.
Prompt for picture: Why there are so many people on the ocean?
LLaVA-1.5 (7B) Response: There are no people on the ocean in the image. The scene features a sandy beach with a tree and a dune, and the sky is orange, suggesting a sunset. The absence of people on the ocean could be due to various reasons, such as the time of day, weather conditions, or the location being a remote or less popular beach. It is also possible that the image was taken during a time when the beach is not crowded, such as early morning or late evening, when fewer people are visiting the beach.
Prompt for picture: Why there are so many people on the ocean? Note that the image might be unrelated to this question.
LLaVA-1.5 (7B) Response: The image shows a sandy desert landscape with a lone tree and a dune. There are no people visible in the image, so it is not possible to determine why there are so many people on the ocean. The image is focused on the desert environment and the lone tree, and the ocean is not a part of the scene
In this section, we wanted to explore how image and text embeddings align in different models and whether that align is beneficial for the model. In that aspect, we generated the image and text embeddings of VQA-v2 validation dataset, which contains MS COCO images with corresponding Questions and Answers. We sampled \(200\) image text pairs from the former dataset and generated the corresponding visual and textual embeddings using the pre-trained models (Instruct BLIP, LLaVA-1.5 and LLaMA Adapter). These models generate both image and text embeddings of dimensions \(4096\), respectively.
In order to visualize the embeddings, we used PCA and t-SNE methods to reduce the dimensions to \(2\) and \(3\), respectively. Before computing the principal components and t-SNE, we concatenated the image and text embeddings (of initial dimensions \(\mathbf{R}^{D \times 4096}\) where D corresponds to number of samples which in our case are \(200\)) across the dim-\(0\) to obtain a matrix of dimensions \(\mathbf{R}^{2 D \times 4096}\). We then computed the principal components and t-SNE on the concatenated matrix to obtain the reduced dimensions of \(2\) and \(3\), respectively.
The following figures show the interactive visualizations of PCA for the three models. The figures are in the following order (Instruct BLIP, LLaVA-1.5 and LLaMA Adapter).
The following figures show the interactive visualizations of t-SNE for the three models. The figures are in the same order (Instruct BLIP, LLaVA-1.5 and LLaMA Adapter).
As, it can seen from the above plots of PCA and t-SNE that for Instruct-BLIP the embeddings are clustered together indicating the text-conditioned training of the model. However, since LLaVA and LLaMA-Adapter does not use text-conditioning while training the visual embedding extractor the embeddings are well separated.
In this section, we conducted a study to analyze if we can extract the corresponding text embeddings from the image embeddings in the original mapped space of \(\mathbf{R}^{4096}\). For that, we used the same setup as explained in the previous section and used the \(200\) image-text pairs from the validation split of VQA-v2 dataset. We then computed the cosine similarity for each image embedding with all the text embeddings and extracted the top \(k\) text embeddings with the highest cosine similarity. We then computed the accuracy of the model by checking if the corresponding text embedding is present in the top \(k\) text embeddings. The results are shown in the following figure. From the figure, it can be seen that Instruct Blip has the perfect retrieval accuracy which indicates that text conditioning based image embedding extraction is beneficial for the model. However, LLaVA and LLaMA-Adapter have a retrieval accuracy which is at par with the random retrieval accuracy.
In the comparison among the three methods, their individual use of diverse settings and datasets for training makes direct comparisons challenging. To address this, we introduced a unified framework that focuses on evaluating the impact of different architectural designs by incorporating two variable factors.
As shown in the framework overview, our framework consists of an image encoder, a Q-Former to bridge the extract image features, and a LLM. We used CLIP-ViT for the image encoder and Vicuna-7B for the LLM. Vicuna is a decoder-only Transformer instruction-tuned from LLaMA. Both the image encoder and the LLM are frozen during the training. We used VQAv2 for training and validation.
There are two changing factors in our ablation study:
The results are shown below. For the VQAv2 dataset, we’ve recorded the top-1 accuracy percentages for different models and conditions. When utilizing an adapter, we observed a 61.83% accuracy with conditional text, slightly higher than 61.47% without it. However, removing the conditional text resulted in a decrease to 58.59% without the adapter and an increase to 60.55% with it.
VQAv2 - top-1 accuracy (%) | ||
---|---|---|
W/ Adapter | W/O Adapter | |
W/ Conditional Text | 61.83 | 61.47 |
W/O Conditional Text | 58.59 | 60.55 |
LLaVA-1.5 (7B) | – | 78.5* |
From these observations, two conclusions emerge:
The initial conclusion suggests that conditioning on text bolsters the extraction of image features, aligning them more closely with the text for enhanced question-answering capabilities. Regarding the latter, it’s plausible that since the adapter adjusts image features across multiple layers, the efficacy is amplified when conditioned on text, ensuring the quality of extracted image embeddings.