Workshop on
Natural Environments Tasks and Intelligence

Multisensory integration for action in natural and virtual environments

Heinrich H. Bülthoff

Many experiments which study the mechanisms by which different senses interact in humans focus on perception. In most natural tasks, however, sensory signals are not ultimately used for perception, but rather for action. The effects of the action are sensed again by the sensory system, so that perception and action are complementary parts of a dynamic control system. In our cybernetics research group at the Max Planck Institute in Tuebingen, we use psychophysical, physiological, modeling and simulation techniques to study how cues from different sensory modalities are integrated by the brain to perceive and act in the real world. In psychophysical studies, we could show that humans integrate multimodal sensory information often but not always in a statistically optimal way, such that cues are weighted according to their reliability. In this talk I will also present our latest simulator technology using an omni-directional treadmill and a new type of flight simulator based on an anthropomorphic robot arm.

Disparity statistics in natural scenes

Lawrence Cormack

Binocular disparity is the input to stereopsis, which is a very strong depth cue in humans. However, the distribution of binocular disparities in natural environments has not been quantitatively measured. In this study, we converted distances from accurate range maps of forest scenes and indoor scenes into the disparities that an observer would encounter, given an eye model and fixation distances (which we measured for the forest environment, and simulated for the indoor environment). We found that the distributions of natural disparities in these two kinds of scenes are centered at zero, have high peaks, and span about 5 degrees, which closely matches the macaque MT cells' disparity tuning range. These ranges are fully within the operational range of human stereopsis determined psychophysically. Suprathreshold disparities (>10 arcsec) are common rather than exceptional. There is a prevailing notion that stereopsis only operates within a few meters, but our finding suggests that we should rethink the role of stereopsis at far viewing distances because of the abundance of suprathreshold disparities.

Scene statistics and prior expectations in depth perception

Marty Banks

Depth cues have been classified as ordinal, relative, and absolute depending on their underlying geometry. In this classification, occlusion is an ordinal cue because it specifies whether one surface is nearer or farther than another. Similarly, binocular disparity (horizontal disparity in particular) is a relative cue because it specifies the relative distance from one surface to another. An alternative is to consider the statistical relationship between cue values and the distribution of depths in the natural environment. Perhaps the visual system has incorporated those statistics, along with prior expectations, to determine most likely depths. I will present work on the combination of two depth cues--shape of an occluding contour and binocular disparity--and will describe the statistically optimal way of combining those cues. Experimental results show that human viewers behave in a fashion rather similar to this optimal combination. I will discuss extensions of this approach to other depth cues and a more general framework for describing human depth perception.

Natural image statistics and the trichromatic cone mosaic

David Brainard

Human color vision is mediated by three classes of cones, each characterized by its own spectral sensitivity. It is this biological fact that underlies behavioral trichromacy, so that essentially any light may be matched by a mixture of three primaries. The human cone mosaic, however, has an interleaved structure, so that at each location there is only one cone. Thus the trichromacy observed for spatially extended stimuli must result from a combination of color information over space. I present a Bayesian calculation that models this integration of information from individual cones. The calculation accounts for a variety of perceptual phenomena. In particular, it can be elaborated into a quantitative model for the appearance of very small monochromatic spots. Empirically, observers provide a wide range of color names in response to spots with a retinal size comparable to that of a single cone (achieved through the use of adaptive optics, Hofer et al., 2005). To model this, I start with the simulated responses of the individual L-, M-, and S- cones actually present in the cone mosaic and use the Bayesian calculation to estimate the trichromatic L-, M-, and S-cone signals that were present at every image location. The calculation incorporates precise measurements of the optics and chromatic topography of the cone mosaic in individual observers, as well as the spatio-chromatic statistics of natural images. I carefully simulated the experimental procedures of Hofer et al., and predicted the color name on each simulated trial from the average chromaticity of the LMS image estimated by our calculation. There were no free parameters to describe individual observers, but none-the-less the striking individual variation in naming emerged naturally as a consequence of the measured individual variation in the relative numbers and arrangement of L-, M- and S-cones. The model also makes testable predictions for experiments that may soon be feasible, including how color naming should vary with spot size and with the fine structure of the retinal mosaic.

The role of visual short-term memory in planning goal-directed hand movements

David Knill

When performing complex tasks involving sequences of goal-directed hand movements, humans have the opportunity to use both on-line visual information and visual short-term memory about target objects to plan movements. I will describe research we have been doing designed to investigate the role of visual short-term memory in planning hand movements in a natural situation in which online visual information is available.Using a sequential movement task in a virtual reality environment in which we strategically perturb the positions of target objects prior to movement planning , we have been able to measure the relative weights that subjects give to on-line and remembered information about target position for planning movements. Not only do we find that subjects give a significant weight to remembered information about visual position, but they do so in a statistically optimal way - when we degrade the on-line visual information (by reducing target contrast), subjects increase the weight that they give to memory for planning movements to the object by an amount consistent with te reduced acuity for object position. I will discuss the results in the context of a model for integratiing information over time and across saccades to outline a number of critical questions that we can ask about the role of VSTM in sensorimotor planning during natural behavior.

The Interactive Routine as Key Construct in Theories of Interactive Behavior

Wayne Gray

Somewhere between 1/3 and 3 seconds, the cognitive, perceptual, and motor elements of embodied cognition come together in dependency networks of constraints to form interactive routines (Gray & Boehm-Davis, 2000; Gray, Sims, Fu, & Schoelles, 2006). Interactive behavior proceeds by selecting one interactive routine after another or by selecting a stable sequence of interactive routines (i.e., a method) to accomplish a unit task (Card, Moran, & Newell, 1983). Adopting Ballard's (Ballard, Hayhoe, Pook, & Rao, 1997) analysis of embodiment, we see these interactive routines as the basic elements of embodied cognition.

The selection, assembly, and execution of interactive routines is typically non-deliberate and non-conscious. The contrast is between deliberative actions performed for some purpose of their own versus routine actions performed as a means rather than as an end. In part this is a level-of-analysis distinction. For example, if you balance your checkbook each month using personal finance software then the goal of balancing your checkbook is deliberate, but eye movements, mouse movements, and retrieval of the temporarily stored location of the current entry and numerical information is not. I will argue that interactive routines should be viewed as the key construct in theories of interactive behavior. To this end, I will review and discuss data that has been collected by many researchers over many years.

Intelligent control of complex systems: How brains do it and how engineers could do it

Emo Todorov

How does the sensorimotor system map sensory signals into motor commands? How can we repair this feedback loop when it breaks, or recreate it in robotic systems? Recent evidence suggests that the brain controls the body in the best way possible. This can be formalized in the framework of stochastic optimal control. The resulting models have unique predictive power. Given parsimonious assumptions in the form of cost functions, they yield elaborate and surprisingly accurate predictions regarding motor behavior as well as neural representations. Apart from illuminating principles of neural information processing, this approach holds promise for control of complex neuro-prosthetic devices as well as autonomous robots. It leverages advances in numerical optimization and computing hardware, automatically synthesizing control laws for otherwise intractable problems. The required numerical optimization is still very challenging, yet the brain somehow manages to do it at least approximately. We have developed efficient new methods for stochastic optimal control which exploit previously unknown properties of optimal feedback control laws. These include hierarchical control via sensorimotor synergies, as well as a new problem formulation which reduces stochastic optimal control to a linear problem. Our methods extend the range of problems that can be tackled in practice. This opens up exciting possibilities for applications in biomedical and control engineering, and enables us to construct more sophisticated models of sensorimotor function.

Human Capacity and Fidelity of Natural Image Representation

Aude Oliva

The human visual system has been extensively trained to deal with millions of natural images, giving it the opportunity to develop robust strategies to quickly identify novel input and remember familiar objects and scenes. Although it is known that the human memory capacity for images is massive, the fidelity with which the human system can represent such a large number of images is unknown. We conducted three large-scale experiments to determine the capacity of visual details of human memory, by systematically varying the amount of detail required to succeed in subsequent memory tests. Contrary to the commonly accepted view that natural image representations contain only the gist of what was seen, our results show that human memory is able to store an incredible large amount of different visual images with a large amount of visual detail per item. These present a great challenge to neural and computational models of object and natural image recognition, which must be able to account for such a large and detailed storage capacity. Work in collaboration with Talia Konkle, Timothy Brady and George Alvarez.

Information Scaling and Manifolds Learning in Natural Images and Video

Song-Chun Zhu

The studies of natural image statistics have been mostly focused on marginal observations such as histograms and co-occurrence matrices. In this talk, I'd present some recent progress in studying the manifolds in hi-dimen spaces of image (video) patches. It is evident that there are two types of manifolds: explicit manifolds of low dimensions for image structures/primitives and implicit manifolds of high dimensions for textures. And there are a spectrum of other manifolds that mixing the two types in the universe of image patches and these manifolds are connected through information scaling. This observation leads to a mathematical model that integrates Markov random fields and sparse coding for the primal sketch representation conjectured by David Marr. I'd also present some manifolds in the video patches which expand in dimensions of lighting and motion.

Eye movements and actions: knowing where to look.

Michael F. Land

For most of the actions we perform we need visual information to guide our limbs, and to get this we must first direct our gaze to the source of this information. Sometimes this is straightforward: we look at the site of the action. However often this is not the case. When hitting a table tennis ball we direct gaze to the bounce point before the ball has reached it. When playing piano music gaze alternates between the two staves, but the motor output comes from the fingers about a second later. When drawing a portrait we extract some feature of the model, store it and subsequently reproduce it on the sketch. When preparing a meal we must locate the utensils we need before they can be put to use. In all these cases the gaze control system is proactive: it is not simply responding to immediate requests from the motor system for information, nor to the salience of objects in the surroundings. In this talk I will discuss different types of gaze control, both in relation to the way the visual system supplies the motor system with information, and to the descending executive commands that specify the nature of the tasks the motor system has to execute.

Bayesian decision theory as a framework for perception and action

Laurence T. Maloney

Bayesian decision theory is a method for computing optimal decision rules given a prior distribution, a likelihood function, and a loss function. A remarkable amount of perception research in the past 25 years can readily be modeled within the Bayesian framework and yet there has been no systematic attempt to test the framework itself, and it is sometimes claimed that it is too encompassing to ever be disproved. I’ll describe ways of testing the components of Bayesian decision theory experimentally and illustrate how to test its applicability to modeling optimal speeded movement that takes into account the subject’s own temporal motor uncertainty. In the experiment described, subjects attempted to touch small targets that abruptly appeared on a screen in front of them. If they touched a target, they earned money but the amount of money earned decreased rapidly over time. If they moved too quickly, they would likely miss the target, too slowly, they would hit it but earn little reward. The problem for the subject is to base their choice of movement plan on their own speed-accuracy tradeoff. In this task and others I will describe, we manipulate loss functions as independent variables and thereby test the Bayesian framework. Subjects’ performance was close to the performance that maximizes expected gain as predicted by the model based on the framework.

Learning 'what' and 'where' decompositions of time-varying natural images via amplitude and phase decomposition

Bruno Olshausen

There is now much theoretical and experimental support for the idea of feature selectivity and sparse coding in visual and other sensory cortical areas. At the same time, it is widely believed that higher cortical areas are involved in representing invariances of the environment, and "slowness" has been proposed as a potential coding objective for achieving this. Here I will discuss the relation between sparseness and slowness, and our efforts to incorporate these two objectives into a hierarchical, generative model of time-varying natural images. We show that when images are decomposed in terms of local amplitude and phase, and higher levels are trained to extract 'sparse and slow' components, it is possible to learn higher-order properties of images corresponding to form and motion. The model raises a number of intriguing questions regarding the nature of cortical representation that can be tested experimentally.

Quantitative Modeling of Gaze control in Natural Tasks

Laurent Itti

Visual processing of complex natural environments requires animals to combine, in a highly dynamic and adaptive manner, sensory signals that originate from the environment (bottom-up) with behavioral goals and priorities dictated by the task at hand (top-down). In the visual domain, bottom-up and top-down guidance of attention towards salient or behaviorally relevant targets have both been studied and modeled extensively. More recently, the interaction between bottom-up and top-down control of attention has also become of topic of interest. A number of neurally-inspired computational models have emerged which integrate components for the computation of bottom-up salience maps, top-down attention biasing, rapid computation of the "gist" or rough context of a scene, objet recognition, and some higher-level cognitive reasoning functions. I will review a number of such efforts, which aim at building models that can both process real-world inputs in robust and flexible ways, and perform cognitive reasoning on the symbols extracted from these inputs. I will draw from examples in the biological/computer vision fields, including algorithms for complex scene understanding, robot navigation, and animation of virtual humans.

Factors controlling allocation of gaze in dynamic environments

Mary Hayhoe

Over the past five to ten years, significant advances have been made in our knowledge of eye movement patterns in the context of a variety of natural behaviors. Despite these advances, we lack understanding of the control mechanisms responsible for the generation of sequences of eye movements that occur in the context of ongoing visually guided behavior. Attempts to explain gaze patterns have almost exclusively concerned only static, restricted stimulus conditions. Such models are unlikely to extend to dynamic natural behavior because the visual input is very different, and because the observer's behavioral goals need to be taken into account. Despite the additional complexity of natural vision, fixation patterns in the context of a wide range of natural behaviors are extremely regular. Subjects behave very similarly without explicit instructions, and react to experimental manipulations in a similar manner. This means that natural behavior is unexpectedly amenable to experimental investigation. I will present evidence from gaze patterns in both real and virtual walking environments. This situation represents a challenge for the appropriate deployment of gaze because it is less predictable than a stable environment like making sandwiches or tea. An important finding is that fixation patterns are not only sensitive to immediate task demands, but also adjust very quickly to changes in the probabilistic structure of the environment that indicate different priorities for gaze allocation. It seems likely that human observers need to learn, and represent, sufficient structure about the visual environment in order to guide eye movements pro-actively, in anticipation of events that are likely to occur in the scene. I will relate these data to behavior-based models of gaze allocation (Sprague, Ballard, & Robinson, 2007).

Generalizing over regions of natural images: a theoretical model for complex cells and other non-linear receptive field properties

Michael Lewicki

Our visual system encodes complex natural edges, contours, and textures, whose retinal image is inherently highly variable. Essential to accurate perception is the formation of abstract representations that remain invariant across individual fixations. A common view is that this is achieved by neurons that signal the conjunctions of image features, but how these subserve invariant representations is poorly understood. I this talk I will discuss an approach that is based on learning statistical distributions of local regions in a visual scene. The central hypothesis is learning these local distributions allows the visual system to generalize across similar images. I will present a model in which the joint activity of neurons encodes the probability distribution over their inputs and forms stable representations across complex patterns of variation. Trained on natural images, the model learns a compact set of functions that act as dictionary elements for image distributions typically encountered in natural scenes. Neurons in the model exhibit a wide range of properties observed in cortical neurons. These results provide a novel functional explanation for non-linear effects in complex cells in the primary visual cortex (V1) and make predictions about coding in higher visual areas, such as V2 and V4. This is joint work with Yan Karklin.

The Operating Point of the Cortex: Neurons as Large Deviation Detectors

Dario L. Ringach

Spiking neurons translate analog intracellular variables into a sequence of action potentials. A simplified model of this transformation is one in which an underlying "generator potential," representing a measure of overall neuronal drive, is passed through a static nonlinearity to produce an instantaneous firing rate. An important question is how adaptive mechanisms adjust the mean and SD of the generator potential to define an "operating point" that controls spike generation. In early sensory pathways adaptation has been shown to rescale the generator potential to maximize the amount of transmitted information. In contrast, I will discuss how in the cortex the operating point appears to be automatically tuned so that cells respond only when the generator potential executes a large excursion above its mean value. Our measurements show that the distance from the mean of the generator potential to spike threshold is, on average, 1 SD of the ongoing activity. Signals above threshold are amplified linearly and do not reach saturation. The operating point is adjusted dynamically so that it remains relatively invariant despite changes in stimulus strength. This operating regimen of the cortex is suitable for the detection of signals in background noise and for enhancing the selectivity of spike responses relative to those of the generator potential (the so-called "iceberg effect"), in non-stationary environments typical of natural scenes.