Can you transfer prompts? What is the best place to append prompts? Do they increase the adversarial robustness? Find out here :)
This project, completed for the Introduction to Deep Learning course, focused on Visual Prompt Tuning (VPT) in Vision Transformers (ViT)
Vision Transformer (ViT): A neural network architecture that applies the transformer model, originally designed for natural language processing, to image analysis tasks.
Visual Prompt Tuning (VPT): A technique that uses continuous vectors in the embedding or pixel space. It involves a frozen transformer backbone and task-specific prompts that are updated during training.
Our project explored various aspects of VPT through several experiments and ablation studies:
We compared three different approaches for prompt placement:
Results showed that prepending to the embedding layer yielded the best performance.
We conducted a sweep from 25 to 150 tokens to determine the optimal prompt size. Key findings include:
We investigated the impact of the number of transformer encoder layers to which learnable prompt parameters were prepended.
We tested the model’s resilience to input noise, demonstrating that adding a prompt universally increased model robustness to noisy inputs.
We explored if prompts trained on one dataset (CUB-200) could provide better initialization than standard methods when applied to a new dataset (Stanford Dogs).