Visual Text Poineer

In this blog post, we intorduce three ways of integrating visual information into LLMs. We also conduct some detailed analysis and comparison among them.

TL;DR

We did a blog post to introduce 3 existing type of integrating vision modality into LLM. Additionally, we also conduct both qualitative and quantitative analysis between three methods.

Introduction

Recent papers have shown the success of adapting LLMs to vision-language tasks with a small amount of fine-tuning required. Various methods are proposed for bridging visual and text modalities. Nonetheless, the comparison and analysis among them are rarely studied. To this end, we plan to write a blog post to give a kind introduction and summary of recent approaches for vision-language adaption for LLM. Although there are lots of related works in recent years, they basically fall into three categories: query-based, projection-based, and parameter-efficient tuning approaches (following definitions in). For each of the categories, we select one representative work for our blog post.

In this blog post, we first give detailed introductions to each representative work individually. Then we show some qualitative and quantitative analysis done by ourselves and provied some conclusion from those results.

Method

Categories Query-based Projection-based Parameter-Efficient Tuning
Selected model InstructBLIP LLaVA LLaMA-Adapter
Is the extracted image features conditioned on text?
Where are the two modalities concatenated? At the input of LLM At the input of LLM At each adapters in multiple layers of LLM
Is the model pretrained on image-caption dataset?
Trainable Params 188M 7B 1.2M
ScienceQA Results (Acc %) 79.5 90.9 85.2
Comparison for the three selected models. (ScienceQA results are reported from each individal paper.)

The upper two rows(Is the extracted image features conditioned on text?,Where are the two modalities concatenated?) of the table are the main concept to distinguish between the three categories.

In the following subsection, we will introduce the details about each selected model.

Query-based

In this section, we choose BLIP-2 and its extention InstructBLIP as the represented papers. In BLIP-2, they propose a lightweight Querying Transformer (Q-Former) to bridge the gap between image and text modalities. The learnable queries learn to extract text-related features from the image in pre-trained stage. In InstructBLIP, they formulate instruction-tuning dataset and propose a Instruction-aware Visual Feature Extraction to extend Q-Former from BLIP-2.

Q-Former

Q-Former is a trainable module to bridge the gap between a frozen image encoder and a frozen LLM. There are two transformer submodules (1) an image transformer and (2) a text transformer that share the same self-attention layers. A set of learnable queries is sent as input to the image transformer, interacting with frozen image features through cross-attention layers. Q-Former is pre-trained in two stages as belows:

Bootstrap Vision-Language Representation Learning from a Frozen Image Encoder

Extract visual representation that is most informative of the text.

In the representation learning stage, Q-Former is connected to a frozen image encoder and perform pre-training using image-text pairs. The goal is to enable queries to extract visual representation that is most relevent to the text. There are three learning objectives, each one employs different attention masking strategy between queries and text.

Figure from . (Left) Model architecture of Q-Former and BLIP-2’s first-stage vision-language representation learning objectives. (Right) The self-attention masking strategy for each objective to control query-text interaction.

Bootstrap Vision-to-Language Generative Learning from a Frozen LLM

In the generative pre-training stage, we connect QFormer(with the frozen image encoder attached) to a frozen LLM. The output query embeddings are projected to the same dimension as the text embeddings of the LLM. Then, the projected query embeddings are prepended to the input text embeddings as ‘soft prompt’. There are two types of LLMs: decoder-based and encoder-decoder based as the below figure shows.

Figure from . BLIP-2’s second-stage vision-to-language generative pre-training. (Top) Bootstrapping a decoder-based LLM (e.g. OPT). (Bottom) Bootstrapping an encoder-decoder-based LLM (e.g. FlanT5).

InstructBLIP

In this paper, the authors conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models.

  1. Instruction-tuning. They use 13 held-in datasets for instruction tuning and 13 held-out datasets for zero-shot evaluation.
  2. Instruction-aware Visual Feature Extraction. They proposes an instruction-aware Q-former module, which extend the Q-Former in BLIP-2 to take in the instruction text tokens as additional input. The instruction interacts with the query embeddings through self-attention layers of the Q-Former, and encourages the extraction of task-relevant image features.
  3. Balanced Data Sampling. Due to the significant differences in the size of each dataset, mixing them uniformly could cause the model to overfit smaller datasets and underfit larger datasets. As a result, they propose to sample datasets with probabilities proportional to the square root of the numbers of training samples.

Experiment

Table from .
Table from .
Table from .

Projection-based

In this section, we chose the works LLaVA and Llava-1.5 as the representative papers. In LLaVA , the authors proposed an end-to-end multimodal model that bridges the gap between an off-the-shelf vision encoder and Large Language Model (LLM) by mapping the representations of the vision model to the same higher dimensional textual embedding space. The authors also curated a Multimodal instruction following data by utilizing language-only GPT-4 model. In Llava-1.5 the authors claim that due to the fact that LLaVA has not been pre-trained on large-scale data, as other similar approaches like query-based methods do (add citation here), the performance of Llava model is undermined. Moreover, LLaVA model suffers from balancing between short- and long-form VQA tasks. To address these issues, the authors used prompt engineering by formatting the response formats and also, increased the complexity of projector layers.

Contributions of LLaVA

GPT-4 Assisted Visual Instruction Data Generation

The authors utilized language GPT-4 model to generate visual instruction tuned data for (image, text) pairs by prompting the GPT-4 model with features of image such as description of scenario and objects in the image and bounding box data of the objects in the image. Then, the model is seeded with a couple of manually curated data. Three type of instruction-following data is collected:

LLaVA Model Architecture

Model image taken from

As LLaVA model relies on off-the-shelf pretrained vision and language models and these models maps their input to a separate high dimensional space. However, to effectively leverage their capabilites (i.e., jointly use the information captured by vision and language embeddings), the embeddings have to be mapped closer in the same higher dimensional space. For an input image \(X_v\), a pre-trained CLIP visual encoder ViT/L-14 is used to extract the visual features \(Z_v = g(X_v)\). The model uses grid features before and after the last Transformer layer for experiments. To convert, the extracted visual features and to provide conditioning to the text, the features are mapped to the space of text token embeddings via a single linear layer. Specifically, \(Z_v\) is converted to \(H_v\) via the learnable matrix \(W\) i.e., \(H_v = W * Z_v\).

Training of LLaVA Model

The data to instruct tune the model is generated as: For each image \(X_v\), multi-turn conversation data is generated as a sequence \(( X_q^1, X_a^1 , \dots, X_q^T, X_a^T )\) where \(T\) represents the total number of turns. At iteration \(t\), the model is given \(X_{instruct}^t\) which is defined as :

\(X_{instruct}^t = \begin{cases} \text{Random choice } [X_v, X_q^1] \text{ or } [X_q^1, X_v] & t=1 \\ X_q^t & t > 1 \\ \end{cases}\)

Instruction tuning is performed on LLM on the prediction tokens via the original auto-regressive training objective. Specifically, model is fine tuned for the loss signal that is generated on predicted answer tokens. The model training is done in two steps:

Experiments

Table from

The experimental analysis showed that LLaVA achieved SOTA performance on mean accuracy on ScienceQA compared to other methods such as Chain-Of-Thought (COT) and LLaMA-Adapter.

LLaVA-1.5

Contributions of LLaVA-1.5

The base architecture of LLaVA is kept intact but the following modifications are made:

The authors claim that LLaVA model was falling short on academic benchmakrs that typically require short-form answers. They attribute this to the fact that LLaVA has not bben pretrained on large-scale data as other approaches do. The following image studies the scaling effect of data, model and image resolution on a selection of three datasets given in the following table.

Table from

Moreover, the authors proposed that to control the length of prompted answer of LLaVA model, they explicitly state that information in the prompt during its fine-training stage which can help the model to learn to control the length of output response. Apart from this, the model is fine-tuned on academic-task-oriented VQA datasets such as open-knowledge VQA, OCR VQA, and region level VQA. Moreover, a two linear layer MLP architecture is used for projecting visual features to text token embedding space. Furthermore, image size is scaled up to $336$ px and LLM model size is scaled to $13$ B.

Experiments

Table from

Based on the previous additions to the model, the performance on a total of $12$ benchmarks of academic VQA benchmarks specifically proposed for instruction following LMMs showed that LLaVA-1.5 achieved SOTA performance across $11$ out of $12$ benchmarks.

Limitations

Parameter-Efficient Tuning

In this section, we choose the LLaMA Adapter as the representative. To incorporate visual information into pretrained LLMs, we can also use adapters. Adapters are common techniques for finetuning large models for downstream tasks. The main concept of adapters is that rather than tuning the entire model, we inject learnable lightweight parameters in different layers of the large model. By doing so, we can steer the pretrained model to a new downstream task. The advantage is that, by injecting adapters into deep layers of LLM, we can change the representation in different depths of the model without the need to update deep layers. The visual information is integrated into the LLaMA model by adding different scales of CLIP image encoder outputs to the learnable adapter prompts.

In the LLaMA Adapter, they propose a new training method Zero-init Gated Attention. When fintuning with adapters in the early training stage, it is often unstable. The reason is that the pretrained model has not yet learned how to utilize the newly injected adapter modules. To this end, in the LLaMA adapter, the authors introduce a gating factor \(\mathbf{g}_l\) on the adapter part of the attention score before multiplying the values in the attention layer. Their ablation studies further substantiate the advantage of this proposed method.

The contributions of LLaMA can be summarized as the following:

The overview of LLaMA-Adapter Architecture. (Figure from )

Zero-Init Attention

To mitigate early-stage disturbance of adapter prompt. The author introduces a learnable gate factor on the attention score of adaption prompts’ positions. Specifically, let’s take a closer look at the attention layer of LLaMA.

Suppose we have \(K\) adaption prompts prepended at the beginning of the original sequence (length=\(M+1\)), \(C\) indicates the hidden dimension of the model. Now, let’s consider the last timestep for attention calculation. \(\mathbf{Q}_t\) is the query vector of the last timestep, \(\mathbf{K}_l\) are the key vectors of the entire input sequence (\(K+M+1\)), \(\mathbf{V}_l\) are the value vectors of the entire input sequence (\(K+M+1\))

To calculate the attention score for the last timestep query to all keys \(\mathbf{S}_l\), we simply do a dot product \(\mathbf{Q}_l\) and \(\mathbf{K}_l\) and normalize it by \(\sqrt{C}\).

Notice that the upper part(first \(K\) rows) of the attention scores \(\mathbf{S}_l\) is affected by adapter prompt (\([\mathbf{P}_1...\mathbf{P}_k]\)) while the rest of them aren’t. To let the model gradually utilize the adaption prompt, the author multiplies the upper part with a learnable gating factor \(\mathbf{g}_l\). The softmax function is applied separately for the upper and lower parts of \(\mathbf{S}_l\) rather than on the whole vector. The reason is that we don’t want the values on the lower part(original attention) to be affected by the values on the upper part(adapter prompts).

In their implementation code, the gating factor is different for each layer and each attention head.

Finally, the modified attention scores are then used to perform a weighted sum over the entire values vector sequence to get the final hidden state for the last timestep. To sum up, if the gating factor is 0, it is an ordinary attention calculation. The author initialized the gating factor as 0, which is also the reason why it’s dubbed as zero-init attention.

The author further conducted some ablation experiments to justify the effectiveness of zero-init attention.

  1. The final performance on ScienceQA

    There is a performance gap about 43% on ScienceQA between w/ and w/o zero-init attention.

    Setting Val Acc (%)
    Rand-Init Attention 40.77
    Zero-Init Attention 83.85
    Gain +43.08
    Table from
  2. Robustness to overfit

    As the model trains more epochs on ScienceQA, we see that it hardly shows overfitting.

    Epoch Train Loss Val Loss Val Acc (%)
    15 0.022 0.136 82.08
    30 0.004 0.241 83.85
    60 0.001 0.282 83.94
    Table from
  3. Convergence

    The loss curve is also shown in the paper. Zero-init attention not only converge faster but also lower.

Training curves w/ and w/o zero-init attention. (Figure from )

In addition to the analysis provided in the manuscript, we are curious how the learning factor grows throughout the training process. Hence, we train the LLaMA-Adapter for 5 epochs on the Alplaca 52K Instruction dataset and we visualize the gating factor for each layer and each head. Notice that we only draw the absolute value of the gating factor since only the magnitude matters. We create an interactive visulization in the following.

As expected, the gating factor gradually grows throughout the training process. We also observed a trend that the gating factors in the upper layers tend to have a higher value. This may be reasonable because the representations in the upper layers are more task-specific. So the role of adapters is more crucial to them.

The results for multimodal QA is in the following table. LLaMA-Adapter out perform previous LLM on ScienceQA.

Question Answering Accuracy (%) on ScienceQA’s test set. "T" denotes the single-modal model with text-only input. (Table from )

Qualitative Analysis-Different types of Image and Prompts

In this section, we perform qualitative analysis by utilizing various images and prompts to explore distinct forms of visual reasoning across three methods. Within this analysis, we categorize our visual reasoning into six distinct types, as outlined below:

  1. Emotion / Atmosphere Inference:

    In this category, we present the model with an image carefully chosen to evoke specific emotions or atmospheric qualities. The challenge for the model is to understand the underlying emotional tone or ambiance depicted in the image. This category tests the model’s ability to utilize visual cues such as lighting, colors, or even implicit reasoning to analyze the emotional atmosphere captured in the image.

  2. Create a Backstory:

    In this category, we prompt the model to construct a compelling and detailed backstory for the characters, places, or objects depicted. This category tests the model’s creativity and imagination to contextualize visual information and create engaging stories using the visual cues provided in the image.

  3. Predict Future:

    In this category, we prompt the model to predict the future. It evaluates the model’s foresight and its ability to infer future scenarios, demonstrating its understanding of causal relationships and the dynamics of the depicted scene.

  4. Explain Object-Object Relationship:

    In this category, the model is prompted to reason the connections, interactions, or dependencies between different objects. This category tests the model’s ability to reason the probable relationship between objects in the image.

  5. Explain Human-Object Relationship:

    In this category, the model is prompted to reason the connections, interactions, or dependencies between humans and objects. This category tests the model’s ability to understand gestures, expressions, and body language of human and their interactions with surrounding objects in the image.

  6. Explain Human-Human Relationship:

    In this category, the model is expected to capture the nature of human’s relationship — whether it is friendly, adversarial, familial, romantic, or professional. This task assesses the model’s proficiency in understanding human emotions and social cues, enabling it to discern complex human relationships.

  7. Confusing Image:

    In this category, the model is presented with an confusing image that might be unusual compared to commonsense. This task evaluates the model’s capacity to capture the uncommon part and interpret it.

  8. Implicit Relationship Inference:

    In this advanced category, the model is presented with subtle visual cues and need to infer the implicit relationships that are not immediately apparent. This evaluates the model’s ability of in-depth thinking and complex visual understanding.

Images for qualitiatve analysis and Model Responses

Emotion/Atmosphere Inference
Prompt for picture: Describe the emotions and atmosphere depicted in the image. Explore the feelings that this setting might evoke in a person and elaborate on the ambiance of the room.

Confusing Image
Prompt for picture: What is the unusual of this picture?

Explain Human-Human Relationship
Prompt for picture: What is the relationship of these two people?

Predict Future
Prompt for picture: What might happen next?

Create a Backstory
Prompt for picture: Invent a detailed backstory for the abandoned old house.

Explain Human-Object Relationship
Prompt for picture: What is this picture about? What’s the feeling of the people and why they have such feelings?

Explain Object-Object Relationship
Prompt for picture: Explain the relationship between the magnifying glass and the antique map.

Implicit Relationship Inference
Prompt for picture: What is this picture about?

Conclusion

The following table summarizes the performance of the models on different prompts (ideas/concepts). If the model correctly covers what the question asks then, we evaluate it as a success and if the model starts hallucinating or even gets stuck or gives an irrelevant answer then, we mark that as a failure.

Model Emotion / Atmosphere Inference Create a Backstory Predict Future Explain Object-Object Relationship Explain Human-Object Relationship Explain Human-Human Relationship Confusing Image Implicit Relationship Inference
InstructBLIP ❌ (Lacks creativity)
LLaVA-1.5
LLAMA-Adapter

Qualitative Analysis-Robustness

To test the robustness of a multi-modal model, we provide a prompt that is completely unrelated to the image. This will evaluate the model’s ability to focus on the provided textual input, ignoring the irrelevant visual context. There are two cases, with or without the hint that the image is unrelated to the picture.

Prompt for picture: Where is the man with the red hat?

Prompt for picture: Where is the man with the red hat? Note that the image might be unrelated to this question.

Prompt for picture: Why there are so many people on the ocean?

Prompt for picture: Why there are so many people on the ocean? Note that the image might be unrelated to this question.

Qualitative Analysis-Embedding Visualizations

In this section, we wanted to explore how image and text embeddings align in different models and whether that align is beneficial for the model. In that aspect, we generated the image and text embeddings of VQA-v2 validation dataset, which contains MS COCO images with corresponding Questions and Answers. We sampled \(200\) image text pairs from the former dataset and generated the corresponding visual and textual embeddings using the pre-trained models (Instruct BLIP, LLaVA-1.5 and LLaMA Adapter). These models generate both image and text embeddings of dimensions \(4096\), respectively.

In order to visualize the embeddings, we used PCA and t-SNE methods to reduce the dimensions to \(2\) and \(3\), respectively. Before computing the principal components and t-SNE, we concatenated the image and text embeddings (of initial dimensions \(\mathbf{R}^{D \times 4096}\) where D corresponds to number of samples which in our case are \(200\)) across the dim-\(0\) to obtain a matrix of dimensions \(\mathbf{R}^{2 D \times 4096}\). We then computed the principal components and t-SNE on the concatenated matrix to obtain the reduced dimensions of \(2\) and \(3\), respectively.

PCA

The following figures show the interactive visualizations of PCA for the three models. The figures are in the following order (Instruct BLIP, LLaVA-1.5 and LLaMA Adapter).

t-SNE

The following figures show the interactive visualizations of t-SNE for the three models. The figures are in the same order (Instruct BLIP, LLaVA-1.5 and LLaMA Adapter).

As, it can seen from the above plots of PCA and t-SNE that for Instruct-BLIP the embeddings are clustered together indicating the text-conditioned training of the model. However, since LLaVA and LLaMA-Adapter does not use text-conditioning while training the visual embedding extractor the embeddings are well separated.

Quantitative Analysis-Image-to-Text Retrieval

In this section, we conducted a study to analyze if we can extract the corresponding text embeddings from the image embeddings in the original mapped space of \(\mathbf{R}^{4096}\). For that, we used the same setup as explained in the previous section and used the \(200\) image-text pairs from the validation split of VQA-v2 dataset. We then computed the cosine similarity for each image embedding with all the text embeddings and extracted the top \(k\) text embeddings with the highest cosine similarity. We then computed the accuracy of the model by checking if the corresponding text embedding is present in the top \(k\) text embeddings. The results are shown in the following figure. From the figure, it can be seen that Instruct Blip has the perfect retrieval accuracy which indicates that text conditioning based image embedding extraction is beneficial for the model. However, LLaVA and LLaMA-Adapter have a retrieval accuracy which is at par with the random retrieval accuracy.

Quantitative Analysis-Unified Framework

In the comparison among the three methods, their individual use of diverse settings and datasets for training makes direct comparisons challenging. To address this, we introduced a unified framework that focuses on evaluating the impact of different architectural designs by incorporating two variable factors.

The model architecture of our unified framework.

As shown in the framework overview, our framework consists of an image encoder, a Q-Former to bridge the extract image features, and a LLM. We used CLIP-ViT for the image encoder and Vicuna-7B for the LLM. Vicuna is a decoder-only Transformer instruction-tuned from LLaMA. Both the image encoder and the LLM are frozen during the training. We used VQAv2 for training and validation.

There are two changing factors in our ablation study:

  1. Text-conditioning on the Q-Former.
  2. Integrating an adapter into multiple LLM layers or merely concatenating image embeddings with text embeddings in the LLM input.

The results are shown below. For the VQAv2 dataset, we’ve recorded the top-1 accuracy percentages for different models and conditions. When utilizing an adapter, we observed a 61.83% accuracy with conditional text, slightly higher than 61.47% without it. However, removing the conditional text resulted in a decrease to 58.59% without the adapter and an increase to 60.55% with it.

  VQAv2 - top-1 accuracy (%)  
  W/ Adapter W/O Adapter
W/ Conditional Text 61.83 61.47
W/O Conditional Text 58.59 60.55
LLaVA-1.5 (7B) 78.5*

From these observations, two conclusions emerge:

  1. Text-conditioning marginally improves performance.
  2. The impact of the adapter is contingent on text-conditioning.

The initial conclusion suggests that conditioning on text bolsters the extraction of image features, aligning them more closely with the text for enhanced question-answering capabilities. Regarding the latter, it’s plausible that since the adapter adjusts image features across multiple layers, the efficacy is amplified when conditioned on text, ensuring the quality of extracted image embeddings.