Bridging Vision and Language for Cross-Modal Understanding and Generation
While large progress has been made in both computer vision and natural language processing in the past decade, bridging vision and language remains a fundamental and challenging problem for advancing artificial intelligence. Vision is the most important approach that we perceive in the world, while language contains high-level semantic information and abstract knowledge for communication and reasoning. Bridging the complementary modalities not only benefits representation learning in each modality, but also empowers the future AI systems to unify perception, communication, reasoning, and creation abilities.
In this talk, I will introduce my research that focuses on vision-language cross-modal understanding and generation. Understanding cross-modal information will enable AI systems to process and learn more complex information and better interact with humans. I will firstly introduce my efforts on vision-language cross-modal understanding, including discriminative image captioning, visual grounding, text-image retrieval, and learning visual representations from language supervision. Besides perception and understanding, creation and imagination abilities are more advanced intelligence. In particular, synthesizing images based on text instructions allows fine-grained and user-friendly control for visual content creation and editing. In the second part of my talk, I will introduce my contributions to visual-language cross-modal generation, including the first open-domain open-vocabulary language-based image editing algorithm, the first unified framework for language-guided and image-guided image synthesis, the first benchmark for compositional text-to-image synthesis, and an algorithm on semantic image synthesis. Finally, I will discuss my research plans for the future.
Dr. Xihui Liu is a Postdoctoral Scholar at the Department of Electrical Engineering and Computer Sciences, UC Berkeley, advised by Prof. Trevor Darrell. Before that, she received her Ph.D. degree from the Department of Electronic Engineering at The Chinese University of Hong Kong and her B.Eng. degree from the Department of Electronic Engineering at Tsinghua University. Her research interests cover the broad area of computer vision, natural language processing, machine learning, and artificial intelligence, with a special focus on the intersection between vision and language. She was awarded Adobe Research Fellowship 2020 and MIT EECS Rising Stars 2021.