While large progress has been made in both computer vision and natural language processing in the past decade, bridging vision and language remains a fundamental and challenging problem for advancing artificial intelligence. Vision is the most important approach that we perceive in the world, while language contains high-level semantic information and abstract knowledge for communication and reasoning. Bridging the complementary modalities not only benefits representation learning in each modality, but also empowers the future AI systems to unify perception, communication, reasoning, and creation abilities.
In this talk, I will introduce my research that focuses on vision-language cross-modal understanding and generation. Understanding cross-modal information will enable AI systems to process and learn more complex information and better interact with humans. I will firstly introduce my efforts on vision-language cross-modal understanding, including discriminative image captioning, visual grounding, text-image retrieval, and learning visual representations from language supervision. Besides perception and understanding, creation and imagination abilities are more advanced intelligence. In particular, synthesizing images based on text instructions allows fine-grained and user-friendly control for visual content creation and editing. In the second part of my talk, I will introduce my contributions to visual-language cross-modal generation, including the first open-domain open-vocabulary language-based image editing algorithm, the first unified framework for language-guided and image-guided image synthesis, the first benchmark for compositional text-to-image synthesis, and an algorithm on semantic image synthesis. Finally, I will discuss my research plans for the future.