Workshop on
Natural Environments Tasks and Intelligence


Richard Andersen
Natural actions represented by neurons in the human posterior parietal cortex
Caltech

The posterior parietal cortex (PPC) is an interesting area of the cerebral cortex that bridges sensory areas to motor areas and accomplishes sensory to motor transformations. In recent years we and others have demonstrated in recordings from animals that the PPC represents the initial intent for movements. Moreover, there is a map of intentions in PPC, with areas specialized for reaching, grasping and eye movements. This finding of intent signals in PPC has led us to investigate whether these signals could be used for neural prosthetic applications in humans. These new clinical studies are the first ever long term neural population recordings from the PPC in humans. We find imagined movement signals in PPC are ideally suited for many prosthetic applications and the human studies have also allowed us to make new discoveries about how this area processes natural actions. These scientific findings are important for designing new prosthetic applications. Among the novel findings from the neural population decoding in humans is the following. 1) The signals in PPC are surprisingly specific for imagined complex movements. 2) PPC represents both the goal of the movement and the trajectory of the movement. 3) Cells are selective for hand postures including those that are not specific to grasping. 4) Both sides of the upper body are represented in PPC, but in an effector dependent manner and movements of multiple body parts can be decoded from the population. 5) The intent activity associated with planned movements occurs before the awareness of intent. 6) Many cells are selective for the observed natural actions of others. In conclusion, we have shown that a variety of natural action-related signals can be decoded from neural population activity in the PPC of humans. For prosthetics applications, we found that these cognitive signals provide fast, intuitive, and versatile control of robotic limbs and computer interfaces.

Matthias Bethge
Using natural image representations to predict where people look
Universität Tübingen

When free-viewing scenes, the first few fixations of human observers are driven in part by bottom-up attention. We seek to characterize this process by extracting all information from images that can be used to predict fixation densities (Kuemmerer et al, PNAS, 2015). If we ignore time and observer identity, the average amount of information is slightly larger than 2 bits per image for the MIT 1003 dataset. The minimum amount of information is 0.3 bits and the maximum 5.2 bits. Before the rise of deep neural networks the best models were able to capture 1/3 of this information on average. We developed new saliency algorithms based on high-performing convolutional neural networks such as AlexNet or VGG-19 that have been shown to provide generally useful representations of natural images. Using a transfer learning paradigm we first developed DeepGaze I based on AlexNet that captures 56% of the total information. Subsequently, we developed DeepGaze II based on VGG-19 that captures 88% and is state-of-the-art on the MIT 300 benchmark dataset. I will show best case and worst case examples as well as feature selection methods to visualize which structures in the image are critical for predicting fixation densities.

Emily Cooper
What 3D scene statistics tell us about 3D vision
Dartmouth University

The study of natural image statistics has yielded substantial advances in our understanding of the organizing principles of early visual processing. Less is known about how these principles apply to 3D vision because of the technical challenges involved in measuring the relevant statistics. I will describe measurements of 3D statistics captured with a custom-built eye and scene tracking system. I will compare these measurements to perceptual and neurophysiological properties of 3D vision and consider the implications for how 3D information is extracted and interpreted by the visual system.

Lawrence Cormack
Continuous tracking as an alternative to traditional psychophysics
The University of Texas at Austin

We recently described a novel framework for estimating visual sensitivity using a continuous target-tracking task in concert with a dynamic internal model of human visual performance (Bonnen et al., 2015). To collect data, observers simply follow a moving target using a mouse or, preferably, simply point at the target while their finger position is recorded by a suitable device, such as a Leap Motion Controller. This task is natural, fun, and produces stable estimates of sensitivity in a matter of minutes. We will show how the results from this paradigm compare to those obtained using traditional psychophysics. Further, we will describe some recent experiments revealing our relative sensitivity to depth vs. frontoparallel motion. Finally, some other potential applications of this technique will be mentioned.

Peter Dayan
Heuristics of control: habitization, fragmentation, memoization and pruning
University College London
Goal-directed or model-based control faces substantial computational challenges in deep planning problems. One popular solution to this is habitual or model-free control based on state-action values. In a simple planning task, we found evidence for three further heuristics: blinkered pruning of the decision-tree, the storage and direct reproduction of sequences of actions arising from search, and hierarchical decomposition. I will discuss models of these and their interactions. This is largely work by Quentin Huys in collaboration with Jon Roiser and Sam Gershman.

Greg DeAngelis
Neural computations for dissociating self-motion and object motion
University of Rochester

Image motion on the retina generally reflects some (unknown) combination of self-motion and objects that move in the world. Thus, a fundamental task for the brain is to dissociate visual motion into components related to self-motion and object motion. I will demonstrate that vestibular signals play important roles in perceptual estimation of self-motion and object motion. In addition, I will show that decoding the responses of a population of multisensory (visual-vestibular) neurons allows the brain to accurately estimate self-motion (heading) in the presence of object motion, or vice-versa. Moreover, the population decoding scheme that achieves this goal can be derived from a general computational strategy for marginalizing over one variable and estimating another variable. Thus, the brain appears to use multisensory signals to carry out near-optimal probabilistic computations that dissociate variables that may be confounded in the peripheral sensory input.

Jim Dicarlo
Neural mechanisms underlying visual object recognition
McGovern Institute of Brain Research, MIT

Visual object recognition is a fundamental building block of memory and cognition, and a central problem in systems neuroscience, human psychophysics, and computer vision. The computational crux of visual object recognition is that the recognition system must somehow be robust to tremendous possible range of images produced by each object and by the range of objects that belong to the same category. The brain's solution to this "invariance" problem is thought to be conveyed by the neural outputs at the top of the ventral visual processing stream -- the inferior temporal cortex (IT). To move the field from phenomenology to model-based understanding, we are testing falsifiable, mechanistic hypotheses of how IT cortex underlies all of object recognition behavior. The current leading hypothesis framework - inspired by previous work of several labs - is that IT cortex is a neuronal population basis for simple, rapid learning of new objects by downstream neurons. Working within this framework, we have recently found that a simple downstream learning rule working on IT neuronal population rates codes accurately predicts both the pattern and magnitude of behavioral performance over a large range of visual object recognition tasks. The predictions are so accurate as to be statistically perfect (i.e. indistinguishable within the variability of human behavior). I will show methodological proof-of-principle for our next test of this hypotheses -- direct causal manipulation (optogenetic and pharmacological) of those IT rate codes and their predicted changes in object discrimination behavior. But what are the neural mechanisms that produce that IT population basis from the visual image? By using machine learning methods to search a set biologically-constrained neural network architectures for those that have high object recognition performance, we have built populations of model "neurons" that are very good predictors of the responses properties of IT neurons, even though these models were not optimized to fit those neural responses directly. Similarly, we found that "neurons" in intermediate layers of these models are also very good predictors of intermediate ventral stream layers (V4). This suggests that these networks contain key neural mechanisms that produce the IT population basis and its support of human object recognition - mechanisms we now aim to decipher.

Alex Huk
Selection and integration of sensory evidence during continuous naturalistic behaviors
The University of Texas at Austin

Everyday life requires dynamic matching of behavior to the tasks at hand. Depending on the task, very different parts of information in the visual field should be used. This is often studied with static attentional paradigms with a single discrete response for each trial. These tasks also typically require significant training. However, this does not match well to natural task-dependent behavior, where both the goal and the distribution of information can change continuously. Thus, very little is known about how the brain accomplishes task-dependent selection and integration of information in more naturalistic continuous behavior. We therefore developed a novel paradigm which allows quantification of the temporal and spatial visual parameters that drive behavior in a continuous ocular tracking task, and which is also suitable for characterizing motion-selective neurons. Here we report behavioral results from this paradigm. The core stimulus consisted of a moving cloud of dots creating a large field optic flow (80 x 50 deg of visual angle). The focus of expansion (FOE) of the flow field moved continuously according to a random walk. To characterize the spatial integration of the visual motion, we divided the field into hexagonal subfields. Each subfield either had a small, random perturbation from the motion associated with the "true" FOE or could be blank. Both macaques and humans intuitively tracked the FOE. We used regression to determine the spatiotemporal parameters that best predicted the subject's gaze. Gaze was most influenced by the FOE location from about 250 ms before, with the majority of temporal information being integrated by 800 ms but lasting up to 2 s. Spatial integration was confined to ±4 deg near the gaze location. When another 45% of the subfields were consistent with the motion of a second, independent FOE (with differently colored dots), subjects could still track the FOE of one flow field with a similar profile of temporal integration. The paradigm provides an intuitive and easy to learn framework that can reveal the spatiotemporal profiles of motion integration during continuous behavior, and can be extended to quantitatively characterize the spatiotemporal dynamics of selection without requiring extensive training in threshold level psychophysics

Richard Murray
Lighting, lightness, and shape
York University

Perception of lighting, lightness, and 3D shape are difficult tasks that are instrinsically related, because any retinal image could have been generated by a wide range of combinations of lighting conditions, surface colours, and surface shapes. This suggests that any visual system that can perceive these properties must make assumptions, at least implicitly, about the statistical distribution of lights, colours, and shapes in the real world. I will describe three experiments that examine what assumptions human vision makes about these properties. The first experiment examines light-from-above prior, which is the human visual system's assumption that light tends to come from overhead. In psychophysical experiments I put the light-from-above prior in conflict with cues like shading and shadows that indicate the true lighting direction in a scene. The results show that the light-from-above prior is astonishingly weak, and can be overridden by miniscule, almost imperceptible amounts of information about the true lighting direction in a scene. The second experiment uses novel measurements from a custom-built multidirectional photometer to examine the diffuseness of natural lighting. These measurements show that natural lighting is typically much more diffuse than we might have expected, and furthermore that the idea that human vision is attuned to diffuse natural lighting conditions can explain some otherwise mysterious errors in lightness perception. The third experiment shows that peoples' ability to detect glowing surfaces, i.e., light sources, is quite sophisticated and takes into account the relationship between image luminance and 3D shape: vivid percepts of glow can be toggled on and off, simply by modulating subtle cues to depth. Together these experiments show that a Bayesian natural-statistics framework is a promising way of understanding human perception of lighting, lightness, and shape.

Peter Neri
Image interpretation controls signal reconstruction from natural scenes
Ecole Normale Supérieure

Early visual cortex represents scenes as spatially organized maps of locally defined features, such as edges and lines. As image reconstruction unfolds and features are assembled into larger constructs, cortex attempts to recover semantic content for object recognition; the evolving interpretation of the image is then fed back to the feature extraction stage and may impact its operation. Although we know that feature extraction operates alongside image interpretation, it is not known exactly how these processes affect each other in the sense of a fully specified computational model. I will review work on feature extraction as an isolated process, as well as its operation under instruction of the image interpretation module via semantic content and scene segmentation. Collectively, the results indicate that the human sensory process must be viewed as a multidirectional system where modules at both ends of the spectrum, from higher to lower level representations, interact with and inform each other in a cohesive manner within a highly integrated architecture. The exact manner in which these interactions occur is unclear at this stage and may not be easily summarized as a one-size-fits-all operation. I will explore potential avenues for making progress in the direction of clarifying these interactions.

Nicholas Priebe
Binocular integration in mice
The Univesity of Texas at Austin

In mammals binocular integration in primary visual cortex (V1) is an important first step in the development of stereopsis, the perception of depth from disparity. Neurons in V1 receive inputs from both eyes, but it is unclear how that binocular information is integrated. Using a combination of extracellular recordings and calcium imaging in the binocular zone of mouse V1, we demonstrate that mouse V1 neurons are tuned for binocular disparities, or spatial differences, between the inputs from each eye, and thus extracting a signal useful for estimating depth. The disparities encoded by mouse V1 are significantly larger than those encoded by cats and primates but correspond to distances that are likely to be ecologically relevant in natural viewing, given the stereo-geometry of the mouse visual system. To shed light on how disparity information emerges from the cortical network, we measured the response selectivity of genetically-defined subpopulations of excitatory and inhibitory neurons. Excitatory neurons exhibit strong selectivity for binocular disparity, but, as with orientation preference in mouse V1, no functional organization for disparity preference is evident across the cortical map. In contrast, one set of inhibitory neurons, parvalbumin (PV+) interneurons, receive strong inputs from both eyes, but lack selectivity for binocular disparity. Their broad selectivity for disparity is related to the degree of functional diversity in the local neuronal network. Finally, because there has been no evidence of mice using binocular vision for behavior, we have trained mice to perform discrimination tasks in which binocular integration must be used to guide appropriate behavior. We find not only that mice can integrate inputs from both eyes for behavior, but that their behavior in these tasks depends on activity in visual cortex. It therefore appears that binocular integration is a common cortical computation used to extract information relevant for estimating depth across mammals. As such, it is a prime example of how the integration of multiple sensory signals is used to generate accurate estimates of properties in our environment.

Eyal Seidemann
Linking single cortical neurons, local population responses in topographic maps, and perception
The University of Texas at Austin

In this talk I will argue that in order to understand the role of a given sensory cortical area in perception, we need to (1) study neural activity in behaving subjects using complementary techniques that measure neural responses at multiple spatial scales, and (2) develop computational tools for understanding the links between these measurements. These efforts are important for two main reasons. First, some aspects of perception may be represented by a distributed population code, where information is encoded by the spatial pattern of activity within a functional topographic map rather than by the activity of individual neurons. In the first part of this talk I will describe a study that combines computational modeling, imaging of population responses in behaving monkeys and behavioral measurements in humans, to test the hypothesis that the spread of population activity in V1's topographic map of visual space contributes to shape perception. Second, current techniques for measuring neural activity in behaving subjects suffer from a fundamental tradeoff between resolution and coverage, such that, even in a simple perceptual task with a localized stimulus, no single technique can measure the activity of all potentially relevant neurons in a given cortical area with cellular resolution. In the second part of the talk I will describe a new method for optical imaging of neural responses in the macaque cortex using genetically encoded calcium indicators. We use this technique to measure visual responses over multiple square millimeters with columnar resolution. Finally, I will describe our initial attempts to develop a simple computational model that links the activity of single V1 neurons to the signals measured at larger spatial scales with population imaging.

Eero Simoncelli
Embedding of prior probabilities in neural populations
New York University

The mammalian brain is a metabolically expensive device, and evolutionary pressures have presumably driven it to make productive use of its resources. For early stages of sensory processing, this concept can be expressed more formally as an optimality principle: the brain maximizes the information that is encoded about relevant sensory variables, given available resources. I'll describe a specific instantiation of this hypothesis that predicts a direct relationship between the distribution of sensory attributes encountered in the environment, and the selectivity and response levels of neurons within a population that encodes those attributes. This allocation of neural resources, in turn, imposes direct limitations on the ability of the organism to discriminate different values of the encoded attribute. I'll show that these physiological and perceptual predictions are borne out in a variety of visual and auditory attributes. Finally, I'll show that this encoding of sensory information provides a natural substrate for subsequent computation (in particular, Bayesian estimation), which can make use of the knowledge of environmental (prior) distributions that is embedded in the population structure.

Stefan Treue
Investigating the multitude of attentional influences on the neural representation of visual motion in primate extrastriate cortex
German Primate Center

Area MT in primate visual cortex is arguably the best understood area in primate extrastriate visual cortex in terms of its representation of the incoming (bottom-up) sensory information. MT is considered to be of critical importance for our ability to perceive the visual motion patterns in our environment. This level of understanding of the neural representation of sensory information in one cortical area is an excellent basis for investigating the top-down influences exerted by various types of attention on MT responses. The talk will give an overview of the multitude of attentional effects that have been discovered with this focused approach, ranging from effects of spatial, feature-based and object-based attention on target and distractor stimuli encoding to multiplicative and non-multiplicative modulations of tuning curves, perisaccadic effects, as well as effects on the representation of change events in the environment and neural responses that are not modulated by attention. From these investigations a clear pattern emerges that turns MT into a model area for the interaction of sensory (bottom-up) signals with cognitive (top-down) modulatory influences that characterizes visual perception. These findings also document how this interaction enables visual cortex to actively generate a neural representation of the environment that combines the high-performance sensory periphery with selective modulatory influences for producing an "integrated saliency map" of the environment.

Chen Yu
Active Vision: What head-mounted eye tracking reveals about infants' active visual exploration
Indiana University

Visual information plays a critical role in early learning and development as infants accumulate knowledge by exploring the visual environment. Beyond the earliest stages of infancy, young children are not mere passive lookers, but they are also active doers. One of the first and most vitally informative types of actions infants take involves the self-control of their looking behaviors to visually explore the world. So-called active vision is key to the goal-directed selection of information to process. The recent emergence of lightweight and wearable cameras allows us to collect vast volumes of egocentric video data and also to record infants' moment-by-moment visual attention when they engage in various tasks in the real world. Using this new technology, recent research in my lab focuses on examining the structure of children's dynamic visual experiences during active participation in a physical and social world. In this talk, I will present several studies, showing how visual information is critical to serve a wide range of natural tasks, from guiding motor action, to learning about visual objects, and to interacting with social partners.