Abstract: From the moment we open our eyes, we are surrounded by people. By observing the people around us, we learn how to interact with them and the world. To create intelligent agents with similar capabilities, it is crucial to endow them with a perceptual system that can interpret and understand human behavior from visual observations. These observations are streams of two-dimensional images; however, the actual underlying state of humans is 4D—they have 3D bodies that move over time. In this talk, I will present my work on perceiving humans in 4D from video. This includes estimating their articulated 3D body pose, tracking them over time and recovering a 4D reconstruction that is consistent with their spatial environment. I will highlight the limitations of systems that only operate in the space of image pixels and showcase the benefits of reasoning in 4D.