[1] Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. Advances in Neural Information Processing Systems, 2021.
[2] Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not All Patches are What You Need: Expediting Vision Transformers Via Token Reorganizations. In International Conference on Learning Representations, 2022.[3] Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token Merging: Your ViT but Faster. In International Conference on Learning Representations, 2022.
[4] Yi Chen, Jian Xu, Xu-Yao Zhang, Wen-Zhuo Liu, YangYang Liu, and Cheng-Lin Liu. Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025.
[5] Quang-Hung Le, Long Hoang Dang, Ngan Le, Truyen Tran, and Thao Minh Le. 2024. Progressive Multi-Granular Alignments for Grounded Reasoning in Large Vision-Language Models. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025.
[6] Seungdong Yoa, Seungjun Lee, Hye-Seung Cho, Bumsoo Kim, Woohyung Lim. ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence. 2025.