Visual design tools and vision language models have widespread applications in the multimedia industry. Despite significant advancements in recent years, a solid understanding of these tools...
Enabling spatial understanding in vision-language learning models remains a core research challenge. This understanding underpins two crucial capabilities: grounding and referring. Referring enables the model to...