[1]
P. Y. Wu and W. R. Mebane, “MARMOT: A Deep Learning Framework for Constructing Multimodal Representations for Vision-and-Language Tasks”, CCR, vol. 4, no. 1, pp. 275–322, May 2022.