Let ViT Speak: Generative Language-Image Pre-training | Yan Fang et al. | ResearchPod