Jump to ContentJump to Main Navigation
Sensory Substitution and Augmentation$

Fiona Macpherson

Print publication date: 2018

Print ISBN-13: 9780197266441

Published to British Academy Scholarship Online: September 2019

DOI: 10.5871/bacad/9780197266441.001.0001

Show Summary Details
Page of

PRINTED FROM BRITISH ACADEMY SCHOLARSHIP ONLINE (www.britishacademy.universitypressscholarship.com). (c) Copyright British Academy, 2020. All Rights Reserved. An individual user may print out a PDF of a single chapter of a monograph in BASO for personal use.date: 01 June 2020

The Processing of What, Where, and How

The Processing of What, Where, and How

Insights from Spatial Navigation via Sensory Substitution

(p.150) 9 The Processing of What, Where, and How
Sensory Substitution and Augmentation

Michael J. Proulx

David J. Brown

Achille Pasqualotto

, Fiona Macpherson
British Academy

Abstract and Keywords

Vision is the default sensory modality for normal spatial navigation in humans. Touch is restricted to providing information about peripersonal space, whereas detecting and avoiding obstacles in extrapersonal space is key for efficient navigation. Hearing is restricted to the detection of objects that emit noise, yet many obstacles such as walls are silent. Sensory substitution devices provide a means of translating distal visual information into a form that visually impaired individuals can process through either touch or hearing. Here we will review findings from various sensory substitution systems for the processing of visual information that can be classified as what (object recognition), where (localization), and how (perception for action) processing. Different forms of sensory substitution excel at some tasks more than others. Spatial navigation brings together these different forms of information and provides a useful model for comparing sensory substitution systems, with important implications for rehabilitation, neuroanatomy, and theories of cognition.

Keywords:   sensory substitution, spatial navigation, object recognition, perception for action

VISION HAS MANY FORMS and behaviours. It is much more complex than first meets the eye—our visual abilities are so rapid and seemingly effortless that they belie the underlying complex mechanisms required. Fortunately we have a fantastic computational machine to solve this task for us—the brain. Sensory substitution devices aim to provide vision in many of its forms and definitions to the visually impaired. The devices do so by translating visual information into tactile or auditory information (Bach-y-Rita and Kercel 2003). What appears to be a simple question, ‘What does it mean, to see?’, is more complicated than was first considered. (p.151) Marr provides a useful, functional definition as a starting point: ‘The plain man’s answer (and Aristotle’s, too) would be, to know what is where by looking’ (Marr 1982). If we unpack this, it is clear that this overlaps with the idea of two processing streams in the visual brain: ‘where’ information in a dorsal stream and ‘what’ information in a ventral stream (Mishkin and Ungerleider 1982). There have been suggested modifications to these categories, in particular that the ‘where’ aspects be considered in terms of perception versus action, and thus ‘how’ rather than just ‘where’ (Milner and Goodale 2008). Although overly simplistic—the idea of seeing for pleasure, aesthetics, and such are missing (Ishizu and Zeki 2011)—these divisions of what, where and how are useful starting points in considering how to restore vision (Proulx et al. 2016).

Spatial navigation provides an important practical and theoretical topic to consider for vision and its substitution. Navigation is a task that is made up of all aspects of vision noted above, and more (Maguire et al. 1999). For example, if you were to visit a city for the first time and needed to have your smartphone looked at, you would need to find the appropriate shop. Doing this would first require accessible information in a database, to find out the location and opening hours. Then a survey representation, such as a map, would often be assessed in some way to determine a route from a starting location to the end point at the shop. Once the general location of the shop is reached, then visual-like behaviours would be of particular use. The ‘what’ information might be the icon for the shop’s logo. Then one must localize ‘where’ it is and, with an assessment of the potential path to it, determine ‘how’ to get there. This level of spatial navigation that integrates information on multiple scales of representation is computationally diverse and challenging. However the components involved have been assessed by different sensory substitution devices, and we can build up an understanding of the strengths, and weaknesses, of different approaches to restoring vision through the other sensory modalities for navigation.

Vision might have some special properties that are challenging to convey to the other senses. A substantial body of work, much of it inspired by Treisman’s Feature Integration Theory (Treisman and Gelade 1980), has established the role of parallel processing in vision. That is, multiple features, and even multiple objects, can be processed simultaneously to a certain extent in vision. The non-visual modalities, in particular haptics, are instead often characterized by sequential or serial processing (Henriques and Soechting 2005; Hsiao et al. 2002). This contrast was made clear in an experiment that tested sighted participants by reducing the visual field with tunnel vision. This forced the serial acquisition of information and thus made visual object recognition performance equivalent to haptic object recognition (Loomis et al. 1991; Rieser et al. 1992). In a recent review, we concluded that developmental vision has a special role in conveying information in parallel (Pasqualotto and Proulx 2012; see also Brown and Proulx 2016). This is a key ability that is crucial for multisensory integration of stimulation that occurs within the same spatial and temporal window (p.152) vital for perception and learning (Proulx et al. 2014). Visual perception thus provides the necessary high level of spatial acuity and throughput to bind together the multisensory experience of the environment.

In this essay we will build up to spatial navigation through a consideration of the evidence for what, where and how forms of information. We will highlight examples from the literature and in particular from our research on some of these topics. We will then return to the larger problem of spatial navigation. We will end with a review of our recent work examining the ways visual experience affects cognition, and suggest further research that builds upon knowledge of multisensory cognition in the visually impaired, and that takes advantage of the strengths of different forms of sensory substitution.

9.1 What: Image Resolution and Object Recognition

Knowing ‘what’ an object is has focused on two general issues in sensory substitution research: image resolution and object recognition. In examining the literature on object recognition and resolution from the perspective of sensory substitution, it appears that visual-to-auditory devices have an advantage. There appear to be two potential reasons for this: the nature of the sensory modality used to substitute for vision, and the capacity of the devices. First, the auditory modality is considered a higher-resolution sensory modality. Second, the extant devices provide a higher theoretical translation of information, and have a higher resolution, in auditory rather than tactile devices.

Vision has the highest capacity for conveying information, even in just the phenomenological sense, well captured by the saying that a picture is worth a thousand words. A sighted person would also likely describe more attributes of a sound (such as music) than of an object only known through touch. Kokjer (1987) estimated the informational capacity of the human fingertip to be in the order of 100 bits per second (bps). The eye, by comparison, has been estimated to deliver around 4.3 × 106 bps (Jacobson 1951), some four orders of magnitude greater bandwidth. The ear falls between these two limits: its capacity has been estimated at around 104 bps (Jacobson 1950).

Do the levels of magnitude that separate the modalities in terms of information-processing capacity imply that the auditory and tactile modalities cannot possibly capture the full experience of vision? Not necessarily. Total information processing in terms of bits per second might allow much more unconscious processing in vision than in the other modalities, as seen by the capability for parallel processing in vision yet serial processing in tactile perception (Pasqualotto and Proulx 2012). This suggests that many aspects of perception for action (Proulx et al. 2016), normally unconscious in vision, might require intentional processing in another modality such as through touch. However the phenomenological experience of (p.153) vision might be within the processing capabilities of the other sensory modalities. It is well known that visual information is selectively reduced at all levels of processing due to the limits of the cones and rods in the retina and onward due to the receptive field properties of neurons through the thalamus and cortex. Raichle (2010) summarizes research findings that demonstrate that only 1010 bps of the potentially unlimited (or with a very high limit) information from the environment is processed at the retina, with a further reduction to 6 × 106 bps leaving the retina via a reduced number of axons in the optic nerve. Further, this is narrowed to 104 bps that reaches V1. Conscious processing of visual information is further restricted to just 100 bps, or perhaps less. Therefore even the bandwidth of visual consciousness is reduced to the information-processing capacity of touch (though perhaps the bandwidth of tactile consciousness is similarly reduced).

The visual acuity, or functional resolution, possible with sensory substitution devices is an important aspect for ascertaining both ‘what’ and ‘where’ information. Visual acuity provides a measure of the distance at which two points are resolvable. Traditionally, optotypes in the form of letters or shapes are presented with decreasing size to determine acuity expressed as a Snellen fraction. The Snellen fraction is the ratio of the testing distance to the distance at which the smallest recognizable optotype subtends 5 arc-minutes, or 0.083 degrees. We noted in our recent work on visual acuity with visual-to-auditory sensory substitution (Haigh et al. 2013) that measuring visual acuity with sensory substitution must consider additional variables taken for granted in normal acuity testing, such as the field of view provided by the device. Vision with normal acuity achieved using a sensory substitution device that employs telescopic techniques would still be classified as visually impaired if restricted by severe tunnel vision (these issues are further explained here: www.seeingwithsound.com/acuity.htm). It is important to note that there are currently physical limitations on the best visual acuity possible through extant sensory substitution devices. A visual-to-auditory sensory substitution device used in our research, the vOICe (Meijer 1992), has a maximum visual acuity in the range 20/160 to 20/240 with a 60° field of view.

The first studies to assess visual acuity with sensory substitution used touch with the ‘tongue display unit’ or TDU (Chebat et al. 2007; Sampaio et al. 2001). Both studies employed the Snellen tumbling E paradigm to test participants’ performance. Sampaio et al. (2001) used a 3 cm2 12 × 12 electrotactile array and a camera with a 54° horizontal and 40° vertical field of view. The 280 × 180 pixel frames were down-sampled to the 12 × 12 tactile display resolution by averaging adjacent pixels with an image converted to black and white only (the vOICe uses grey-scale images). Because the device provided a resolution of 12 pixels horizontally, we estimated that the functional acuity might have a maximum acuity of 20/2160 when calculated for the camera’s 54° field of view. Chebat et al. (2011) used a 4 cm2 10 × 10 array TDU. We estimated that the maximum theoretical acuity for a 10-pixel device such as this would be 20/1392 when calculated for the 29° field of view.

(p.154) More recent studies have found higher levels of acuity using hearing as the substituting modality. Acuity using the vOICe was reported by Striem-Amit et al. (2012) and by Haigh et al. (2013). The study by Haigh et al. (2013) in particular found that very little training or explanation was required to find high levels of visual acuity in the participants. The final acuity category of 20/408 in this study represented the upper limit of performance for all participants; this corresponded to optotypes reduced to 5 pixels in width. This limitation was perhaps due to the physical representation of the optotype provided being a sub-sampling of the original image. This suggested that participants were able to discriminate the orientation of the optotype with two vertical pixels representing the letter E when a minimum of five should be necessary for accurate discrimination. The expert blind participants in the Striem-Amit et al. (2012) study may have performed even better (in the range of 20/600 to 20/200) due to multisensory perceptual learning processes (Proulx et al. 2014). Also, all of their participants were blind, all but one of them congenitally, and this may have been a factor in their high performance (Chebat et al. 2007).

So although parallels between the auditory and visual systems are not obvious in the way that the skin/retina analogue is (which was the inspiration for Bach-y-Rita’s original tactile sensory substitution device), the ear has the potential to provide a higher-throughput means of directing visual information to the brain than the skin. Moreover, even though the visual system might have the greatest information-processing capacity and spatial acuity, the auditory system is also a high-resolution sensory modality in the temporal sense. An experiment that assessed acuity in this domain used temporal order judgements to assess the temporal acuity of the senses and found there were 141 ms for tactile stimulation, 21.5 ms for auditory, and 29 ms for visual stimulation (Laasonen et al. 2001). Thus the auditory system excels at temporal processing, and a system that draws on this capacity for the translation of visuospatial information might be best placed to provide high-resolution sensory substitution (however, see also Brown et al. 2015). Indeed the superior visual acuity performance found with the vOICe might be due not only to the number of pixels that can be translated by the device but also to the higher information-processing capacity of hearing versus touch (Brown et al. 2014).

Ultimately, a decent level of acuity will allow one to recognize objects with some level of accuracy and, ideally, speed. Perhaps not coincidentally, most research on the recognition of natural objects with sensory substitution has focused on hearing as the substituting sense. For example, Auvray et al. (2007) found that participants using the vOICe were able to discriminate among natural, three-dimensional objects belonging to the same category and identify object categories as well (see also Pasqualotto and Esenkaya 2016). Pollok et al. (2005), with the same participants as Proulx et al. (2008), found that training with three-dimensional objects at home and in the lab generalized to two-dimensional object recognition presented via images sonified with the vOICe. Another visual-to-auditory device, the PSVA, has been used to demonstrate the recognition of visual patterns, though (p.155) not the complex objects used in the other work described here (Arno et al. 1999). Tactile-based substitution has not been used for object recognition beyond that of the pattern recognition necessary for testing visual acuity. More recent work has examined the learning and transfer of novel, natural, three-dimensional objects in greater detail. For example, Stoerig and Proulx (2008) required participants learning to use the vOICe to specifically identify objects seen through sound, including details to verify object recognition. All participants exceeded chance levels within the first few hours of training, reaching an average of over 60 per cent correct after 15 hours of training, with some reaching perfect performance. Of course the experimental conditions were much simpler than normal natural scenes, and reducing the amount of clutter that is normally presented to the eye allowed the focus to be on purely identifying the objects rather than having to also perform scene segregation (see also Brown and Proulx 2016).

9.2 Where and How: Localization for Action

The superior resolution available currently with visual-to-auditory sensory substitution might also be of use for fine-grained spatial localization. However the method of sonifying spatial locations and the use of active motion might also be crucial. Due to the potential applications of sensory substitution in the daily lives of visually impaired persons, many tests require active localization via pointing or grasping; thus this work provides both ‘where’ and ‘how’ forms of information; the need to not only perceive objects, but to act in response to them, is a key component of this work, consistent with the perspective of Milner and Goodale (2008) on the primacy of perception for action, rather than perception alone (Proulx et al. 2016).

Auditory localization of objects has been extensively studied in human and nonhuman participants (Konishi 2000). The type of localization task that is crucial for sensory substitution is distinct from normal auditory localization, even if hearing is the substituting modality. Instead of relying upon a distal source of sound (such as a person’s footsteps), a user must rely upon the conversion system generating sound representations of silent, distal and immobile objects. Lateral auditory localization normally depends on two input properties known as interaural loudness difference (ILD) and interaural time difference (ITD); for a review see Moore (1982). In general, the human system for horizontal auditory localization can take advantage of both of these types of information, or each one separately if they are not both present (Middlebrooks and Green 1991). Performance is best when all forms of information are present, providing a rich signal to localize a sound source. Behavioural cost in terms of response time and accuracy results when only one type of information is supplied (Schroger 1996), and has also been observed with neurophysiological methods (Palomaki et al. 2005). Vertical (elevation) auditory localization primarily depends on the spectrum of the sound cue, as created by the (p.156) shape of the pinna, or outer ear (Middlebrooks and Green 1991). The pinna amplifies some frequencies and weakens others, resulting in a specific spectral tuning that is unique for each person (Hofman et al. 1998).

The perceptual learning of sound localization in humans has been examined with other objectives and methods. For example, Hofman et al. (1998) modified the outer ears of human subjects with a mould, and examined the sound localization of the subjects over time. They found that their subjects could indeed learn to hear with their newly-shaped ears, but proposed that this was only because of the richness of the auditory cues, perhaps consistent with multisensory perceptual learning (Proulx et al. 2014), and the use of the new ears in daily life. Like Hofman and colleagues, Proulx et al. (2008) provided daily experience with a novel auditory input to participants, although for less time than in Hofman’s study (10 or 21 days for Proulx et al. versus 39 days in Hofman et al.’s study). In further contrast, the participants had two artificial signals provided by the vOICe for object position: laterality was coded by stereo panning and the time provided by the left-to-right scanning transformation of each image, and elevation by frequency, so that up was represented by high frequencies and down by low frequencies. Pixel brightness was coded by loudness. Images were taken by a small video camera hidden in sunglasses and converted into the sound patterns that the subjects heard through stereo headphones. The object’s horizontal location in the image is coded by time and stereo panning, so that the sound pertaining to an object on the right of the image will be heard late in the scan and predominantly through the right ear, using both ILD and ITD. Whether it is located higher up or lower down in the image is expressed by frequency, so that an object in the upper part of the image is heard in higher tones than one in the lower part. Together this provided the ‘where’ information.

The first experiment by Proulx et al. (2008) assessed localization with a perimetry device that was constructed with an array of LEDs for a manual search task. One LED would light up and would be accompanied by a tone until pressed, when the light and tone were extinguished. Over a three-week period, those participants using the vOICe in daily practice with natural objects in natural environments (their own homes) were able to generalize that experience to the lab test with significant improvements in speed and accuracy. A second experiment examined the localization and grasping of natural objects placed on a large table (Auvray et al. 2007). Again we found successful transfer of experience in the home to the lab, where those trained to use the vOICe had significant improvement in not only locating the objects, but reaching with grasp-appropriate hand configurations. This suggested that they not only understood ‘where’ the objects were, but had access to features related to ‘what’ the objects were, too: size, shape, and orientation.

More recent work by Brown et al. (2011) found that the location of the camera providing visual input interacted with the goal of the task. For example, while Proulx et al. (2008) used a head-mounted camera to mimic eyesight, Auvray et al. (2007) used a handheld camera for their tasks. Brown and colleagues compared (p.157) the performance for tasks requiring either object identification or localization with both camera positions. They reported an interesting dissociation: object identification was better with the handheld camera and localization was better with the head-mounted camera. This suggests that the ability to sample many viewpoints of the object via the hand is particularly useful for identifying it, and also that mimicking the perceptual-motor contingencies used in normal localization can improve performance as well, with the viewpoint near the eyes.

The ability to correct fast reaching movements has been assessed with a device that is similar to the vOICe in its basic structure, except with the addition of timbre as a feature to code for colour in EyeMusic (Levy-Tzedek et al. 2012). The tasks required rapid reaching to a target, and with little training participants were able to perform the task nearly as accurately with the device as with seeing. One problem for interpreting the results, however, is that the amount of time allowed to listen to the object location was, potentially, unlimited before the fast reaching movement was required. That period of time is crucial for real-world navigational needs, and for making crucial comparisons with vision. If visual perception is thought of as ‘unconscious inference’ in the ways proposed by Von Helmholtz (2005), with some requirement for rapid processing that would go beyond conscious, effortful inference, then the current state of sensory substitution would require extensive training and experience before that threshold is reached. This and a number of visual-to-tactile studies often report either accuracy or time as the primary measure of performance; it is challenging to benchmark the relative costs and benefits of different devices and sensory modalities in the absence of all the necessary data to consider their use.

Another visual-to-auditory sensory substitution device has been used for studies of localization: the PSVA or prosthesis for substitution of vision by audition (Capelle et al. 1998). Unlike the vOICe, which sweeps the image from left to right to create the sonification, the PSVA provides a simultaneous sonification of the entire image and thus requires manual movement (either by the participant or the camera) to make sense of the image, similar to the concept of using eye movements to perceive a scene. While the studies described above that employed the vOICe implicitly required the perception of depth for the accurate localization and grasping of objects, a study with the PSVA explicitly examined the problem of depth. Both devices use a single camera, thus depth must be inferred from monocular cues rather than stereopsis. Renier et al. (2005) examined the ability of participants to locate items in depth using cues similar to those present in a natural corridor (and consequently the same cues that can create compelling corridor illusions). Although reaching and grasping were not employed, the participants were able to perceive and report the depth relations between the objects in the display.

The qualitative debriefings reported by Renier et al. (2005) provide interesting observations about the learning process, and how depth was understood by the participants. Those participants who were blind from birth or in early childhood (p.158) had to first learn the basics of vision that a sighted person can take for granted: shape, size constancy, depth information, occlusion, perspective, and other ways in which the brain has learned, through development, to ‘correct’ for distortions in the two-dimensional image that are simply due to the viewing perspective rather than revealing changes in the object itself (Proulx and Harder 2008). It was noteworthy that all early blind participants were aware that relative size was a depth cue on the basis of hearing how sighted persons described the world. This is crucial because the acquisition of object information through haptics is size-invariant of course; a far-off object is out of reach, and any object that can be touched is perceived as its unchanging three-dimensional size. Similar findings have been reported for visual-to-tactile sensory substitution (Segond et al. 2013). Even with this top-down knowledge of how distance impacts the perception of size, the early blind participants still had more difficulty correcting for that than those who were sighted (but blindfolded) and those who became blind later in life. Thus learning the vocabulary of vision through sensory substitution (Deroy and Auvray 2012) might require not only active experience (Proulx 2010), but perhaps visual experience as well for full functioning (Pasqualotto and Proulx 2012).

9.3 Spatial Navigation as the Processing of What, Where and How

Given the superiority of auditory perception in the domain of temporal processing, most studies have focused on the performance of visual-to-auditory sensory substitution for representing object information with high fidelity and acuity. However one study has also examined the use of the vOICe for spatial navigation. One challenge for using the vOICe for navigation is that such a task requires online perception and correction to avoid stationary obstructions, much less moving ones. The vOICe only samples the environment with an image every 1–2 seconds (even though it can be set at much higher refresh rates, no studies have examined the ability of users to function at those sampling rates). Brown et al. (2011) had participants walk a short route that contained four obstacles using the vOICe. It took them over five minutes to complete the task, though they demonstrated improvement of over one minute with eight trials of practice. Certainly real-world navigation would introduce challenges that would make it practically unfeasible in its current form. However navigation in vision takes advantage of peripheral vision with its lower resolution; perhaps a simpler representation is better for such a task. One auditory example might be the simplified auditory input that can be provided by learned echolocation in some blind humans (Thaler et al. 2011).

Representing spatial information in an online manner might instead provide a task for which visual-to-tactile sensory substitution would excel. Devices like the TDU provide constant stimulation of the environment and as a result require (p.159) movement for accurate localization and identification of objects (Matteau et al. 2010). As noted previously, the resolution of such devices is lower than that of an auditory device like the vOICe. This might not be a drawback for a task like navigation. Normal obstacle avoidance utilizes peripheral vision, which provides lower-resolution information compared to the fovea due to a combination of having fewer cone cells and the cortical magnification factor that favours foveal vision (Loschky et al. 2005). Peripheral vision is also a primary contributor to magnocellular processing which, like the representation provided by the TDU, is selective for contrast (the TDU only has black and white representation) and motion (Livingstone and Hubel 1988).

The methods of assessing spatial navigation with tactile vision substitution systems have been creative. Segond et al. (2005) created a semi-passive paradigm where the participants using the TDU remained still, but operated a camera-carrying robot with a remote control. The robot was placed in a three-dimensional maze. The accuracy and speed of the maze completion were assessed and demonstrated success from the very first attempt. Again it is challenging to generalize from this experience directly to real-world navigation. A more recent study with the TDU (Chebat et al. 2011) used a real course in a corridor with obstacles, similar to that tested with the vOICe (Brown et al. 2011). Not only were participants able to successfully navigate the course with few errors, but it was also found that congenitally blind individuals were able to out-perform sighted participants. This is interesting because it further suggests that route or egocentric spatial knowledge is preserved even in the absence of visual experience (Pasqualotto and Proulx 2012; Pasqualotto et al. 2013b), although allocentric spatial knowledge might be affected. Although the participants were successful in terms of accuracy, this study, like others in the literature, did not provide as much information about the time required to complete the task, and thus it is difficult to fully assess the use of the TDU and tactile perception rather than the vOICe and auditory perception for such a task.

Real-world navigation requires a combination of cognitive skills such as object recognition (what), localization (where) and obstacle avoidance (how). Furthermore it would be beneficial to have access to allocentric, map-like representations and perhaps even some knowledge of cardinal directions, such as that provided by sensory augmentation devices like vibrotactile belts (Karcher et al. 2012; Nagel et al. 2005). One solution that has yet to be tested is the use of more than one device at once. For example, integrating a smartphone for address and route information, a belt for cardinal directions, a tactile device for obstacle avoidance, and an auditory device for high-resolution shape recognition and localization might be ideal. In fact, this might best represent how human spatial navigation normally works; yet the challenge to monitor and incorporate input from such a plethora of devices might be beyond current human capacities.

(p.160) 9.4 Cross-Modal Correspondences

The basic idea of creating a sensory substitution device is predicated on the utility of translating information from that normally sensed by one modality to a format that can be sensed by another modality. Although humans might be able to learn any arbitrary association between two sensory modalities, it would seem sensible to base the conversion principles upon any natural cross-modal correspondences that humans have. There is increasing evidence that a task that targets one sense can be affected by the stimulation of the other senses, even if such stimulation is task-irrelevant and below the threshold of awareness (Evans and Treisman 2010; Mondloch and Maurer 2004). The vOICe takes its basic conversion principle from experiments that have shown that the perception of high or low frequency (pitch) affects lexical decision tasks, such as deciding one read the word ‘down’ at the same time as hearing a tone of low frequency (Walker and Smith 1984). Thus stating that an auditory pitch is high or low is not just metaphorical, but directly affects one’s perception of the world. Similarly, auditory and vibrotactile stimuli are known to share similar temporal patterns, and research has shown that this similarity can also lead to perceptual bias (Schurmann et al. 2004). Moreover, developmental research shows that cross-modal interaction occurs from an early age (Mondloch and Maurer 2004).

As useful as such studies might be for designing a sensory substitution device, they all have one drawback: each used sighted participants exclusively. Thus, the use of such mappings assumes, as some have proposed, that these cross-modal correspondences are hard-wired or otherwise are based on experience that does not require visual perception (Spence and Deroy 2012). Whether this assumption is correct is currently unknown (however, see also Hamilton-Fletcher et al. 2018), and thus it may be necessary to establish whether the mappings reported in the cross-modal interaction literature apply to individuals with congenital visual impairment. In fact, evidence from research on accessible technology suggests that some visually impaired users have inverse polarity for mapping data values, such as temperature, to acoustic parameters, such as pitch (Walker and Mauney 2001).

We have found other differences between congenitally blind individuals on the one hand, and those with visual experience such as the late blind and sighted, on the other hand. In particular it appears that visual experience is necessary for the normal neural development that underlies spatial cognition (Pasqualotto and Proulx 2012). For example, sighted and late blind participants prefer an allocentric representation to remember the spatial locations of aligned objects in extrapersonal space (outside of arm’s reach), even if such a representation requires a form of mental rotation to recode the locations of the objects in relation to one another. Congenitally blind individuals, though, prefer an egocentric representation based on the starting location used for acquiring knowledge of the location of the objects (Pasqualotto et al. 2013b). Visual experience also impacts semantic processing in (p.161) memory (Amedi et al. 2003; Bedny et al. 2011; Pasqualotto et al. 2013a). Given the role of visual experience for cognitive processing in the sighted, there is certainly a possibility that the cross-modal mappings used by extant sensory substitution devices can stand to be improved, or tailored for users with different amounts of visual experience. Of course, the brain plasticity observed in many studies suggests that almost any arbitrary algorithm can be learned; however, perhaps that arbitrariness is what makes learning so challenging and unappealing to potential users.

9.5 Opportunities of Sensory Substitution

There are many dimensions of the psychology of sensory substitution that are open to active investigation. The current limits on the functionality of devices arise from multiple domains: technological, such as device resolution limit; modality, such as the resolution or nature of the sensory system substituting for vision; mapping algorithm, based upon cross-modal correspondences; and learning and plasticity, such as the optimal training required for multisensory perceptual learning (Proulx et al. 2014) and generalization (Brown and Proulx 2013). There are certainly reasons to be optimistic for the future of sensory substitution. First, naïve users are able to perform not only above chance with minimal training, but even to near ceiling degrees of visual acuity; moreover, even a little bit of training improves performance (Haigh et al. 2013) and that improvement can be maintained over several months and be generalized beyond that specifically practised during training (Brown and Proulx 2013). These points are crucial and should be considered in the context of the development of normal vision; certainly human infants do not learn to see as adults in merely one day. The current state of the art also suggests that different devices and modalities might be advantageous for different tasks, though it is unclear at the present time whether this is due to the nature of the devices or the substituting modalities.

There are also many other unanswered questions of central importance for both basic and applied research in psychology, neuroscience and philosophy. We have noted elsewhere that it appears active sensorimotor expertise and experience might be necessary for full mastery of a sensory substitution device, and to a much lesser extent the experience of visual qualia through these means (Proulx 2010; Proulx and Stoerig 2006). It is therefore an open question whether something like visual qualia is possible in a congenitally blind user who has not had the visual experience necessary to store visual images that can be linked with the device output. This brings up the definition of vision once again. Is vision defined by what one does with it, or the experience of it? A proficient user of a sensory substitution device might appear to see from the perspective of an independent user, even though only some users might have visual qualia evoked by the experience. On the applied side, creating a multimodal device that incorporates the best of the extant devices might be of interest; combining perhaps an auditory device for high-resolution object (p.162) recognition and a tactile device for navigation, for example. Such a device would require a greater understanding of the attentional demands that would be imposed, not only in attending to the two devices at once, but the ambient environment as well. Although current sensory substitution devices appear to be limited in being adopted universally for use, the potential for using sensory substitution as a tool to better understand basic cognition is clear. Such fundamental investigations should in turn provide opportunities to improve the viability of sensory substitution in the future and overcome the present limitations.


Bibliography references:

Amedi, A., Raz, N., Pianka, P., Malach, R. and Zohary, E. (2003) Early ‘visual’ cortex activation correlates with superior verbal memory performance in the blind. Nature Neuroscience 6(7): 758–766. doi: 10.1038/nn1072.

Arno, P., Capelle, C., Wanet-Defalque, M. C., Catalan-Ahumada, M. and Veraart, C. (1999) Auditory coding of visual patterns for the blind. Perception 28(8): 1013–1029.

Auvray, M., Hanneton, S. and O’Regan, J. K. (2007) Learning to perceive with a visuo-auditory substitution system: localisation and object recognition with ‘the vOICe’. Perception 36(3): 416–430.

Bach-y-Rita, P. and Kercel, S. W. (2003) Sensory substitution and the human-machine interface. Trends in Cognitive Science 7(12): 541–546. doi: S1364661303002900 [pii].

Bedny, M., Pascual-Leone, A., Dodell-Feder, D., Fedorenko, E. and Saxe, R. (2011) Language processing in the occipital cortex of congenitally blind adults. Proceedings of the National Academy of Sciences 108(11): 4429–4434.

Brown, D., Macpherson, T. and Ward, J. (2011) Seeing with sound? Exploring different characteristics of a visual-to-auditory sensory substitution device. Perception 40(9): 1120–1135.

Brown, D. J. and Proulx, M. J. (2013) Increased signal complexity improves the breadth of generalization in auditory perceptual learning. Neural Plasticity 9. doi: 10.1155/2013/879047.

Brown, D. J. and Proulx, M. J. (2016) Audio-vision substitution for blind individuals: addressing human information processing capacity limitations. IEEE Journal of Selected Topics in Signal Processing 10, 924–931.

Brown, D. J., Simpson, A. J. R. and Proulx, M. J. (2014) Visual objects in the auditory system via sensory substitution: how much information do we need? Multisensory Research 27, 337–357. doi: 10.1163/22134808-00002462.

Brown, D. J., Simpson, A. J. R. and Proulx, M. J. (2015) Auditory Scene Analysis and soni-fied visual images: does consonance negatively impact on object formation when using complex sonified stimuli? Frontiers in Psychology 6. doi: 10.3389/fpsyg.2015.01522.

Capelle, C., Trullemans, C., Arno, P. and Veraart, C. (1998) A real-time experimental prototype for enhancement of vision rehabilitation using auditory substitution. IEEE Transactions on Biomedical Engineering 45(10): 1279–1293. doi: 10.1109/10.720206.

Chebat, D. R., Rainville, C., Kupers, R. and Ptito, M. (2007) Tactile-‘visual’ acuity of the tongue in early blind individuals. Neuroreport 18(18): 1901–1904. doi: 10.1097/ WNR.0b013e3282f2a63.

(p.163) Chebat, D. R., Schneider, F. C., Kupers, R. and Ptito, M. (2011) Navigation with a sensory substitution device in congenitally blind individuals. Neuroreport 22(7): 342–347. doi: 10.1097/WNR.0b013e3283462def.

Deroy, O. and Auvray, M. (2012) Reading the world through the skin and ears: a new perspective on sensory substitution. Frontiers in Psychology 3: 457. doi: 10.3389/ fpsyg.2012.00457.

Evans, K. K. and Treisman, A. (2010) Natural cross-modal mappings between visual and auditory features. Journal of Vision 10(1). doi: 10.1167/10.1.6.

Haigh, A., Brown, D. J., Meijer, P. and Proulx, M. J. (2013) How well do you see what you hear? The acuity of visual-to-auditory sensory substitution. Frontiers in Psychology 4. doi: 10.3389/fpsyg.2013.00330.

Hamilton-Fletcher, G., Pisanski, K., Reby, D., Stefan´czyk, M., Ward, J. and Sorokowska, A. (2018) The role of visual experience in the emergence of cross-modal correspondences. Cognition 175, 114–121.

Henriques, D. Y. and Soechting, J. F. (2005) Approaches to the study of haptic sensing. Journal of Neurophysiology 93(6): 3036–3043. doi: 10.1152/jn.00010.2005.

Hofman, P. M., Van Riswick, J. G. and Van Opstal, A. J. (1998) Relearning sound localization with new ears. Nature Neuroscience 1(5): 417–421. doi: 10.1038/1633.

Hsiao, S. S., Lane, J. and Fitzgerald, P. (2002) Representation of orientation in the somatosensory system. Behavioural Brain Research 135(1–2H): 93–103.

Ishizu, T. and Zeki, S. (2011) Toward a brain-based theory of beauty. PLoS One 6(7): e21852. doi: 10.1371/journal.pone.0021852.

Jacobson, H. (1950) The informational capacity of the human ear. Science 112(2901): 143–144.

Jacobson, H. (1951) The informational capacity of the human eye. Science 113(2933): 292–293.

Karcher, S. M., Fenzlaff, S., Hartmann, D., Nagel, S. K. and Konig, P. (2012) Sensory augmentation for the blind. Frontiers in Human Neuroscience 6: 37. doi: 10.3389/ fnhum.2012.00037.

Kokjer, K. J. (1987) The information capacity of the human fingertip. IEEE Transactions on Systems, Man and Cybernetics 17(1): 100–102. doi: 10.1109/TSMC.1987.289337.

Konishi, M. (2000) Study of sound localization by owls and its relevance to humans. Comparative Biochemistry and Physiology Part A: Molecular and Integrative Physiology 126(4): 459–469. doi: http://dx.doi.org/10.1016/S1095-6433(00)00232-4.

Laasonen, M., Service, E. and Virsu, V. (2001) Temporal order and processing acuity of visual, auditory, and tactile perception in developmentally dyslexic young adults. Cognitive, Affective, and Behavioral Neuroscience 1(4): 394–410.

Levy-Tzedek, S., Hanassy, S., Abboud, S., Maidenbaum, S. and Amedi, A. (2012) Fast, accurate reaching movements with a visual-to-auditory sensory substitution device. Restorative Neurology and Neuroscience 30(4): 313–323. doi: 10.3233/RNN-2012-110219.

Livingstone, M. and Hubel, D. (1988) Segregation of form, color, movement, and depth: anatomy, physiology, and perception. Science 240(4853): 740–749. doi: 10.1126/ science.3283936.

Loomis, J. M., Klatzky, R. L. and Lederman, S. J. (1991) Similarity of tactual and visual picture recognition with limited field of view. Perception 20(2): 167–177.

(p.164) Loschky, L., McConkie, G., Yang, J. and Miller, M. (2005) The limits of visual resolution in natural scene viewing. Visual Cognition 12(6): 1057–1092. doi: 10.1080/ 13506280444000652.

Maguire, E. A., Burgess, N. and O’Keefe, J. (1999) Human spatial navigation: cognitive maps, sexual dimorphism, and neural substrates. Current Opinion in Neurobiology 9(2): 171–177.

Meijer, P. B. L. (1992) An experimental system for auditory image representations. IEEE Transactions on Biomedical Engineering 39: 112–121.

Marr, D. (1982) Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. San Francisco: W. H. Freeman.

Matteau, I., Kupers, R., Ricciardi, E., Pietrini, P. and Ptito, M. (2010) Beyond visual, aural and haptic movement perception: hMT+ is activated by electrotactile motion stimulation of the tongue in sighted and in congenitally blind individuals. Brain Research Bulletin 82(5–6): 264–270. doi: 10.1016/j.brainresbull.2010.05.001.

Middlebrooks, J. C. and Green, D. M. (1991) Sound localization by human listeners. Annual Review of Psychology 42: 135–159. doi: 10.1146/annurev.ps.42.020191.001031.

Milner, A. D. and Goodale, M. A. (2008) Two visual systems re-viewed. Neuropsychologia 46(3): 774–785. doi: 10.1016/j.neuropsychologia.2007.10.005.

Mishkin, M. and Ungerleider, L. G. (1982) Contribution of striate inputs to the visuospatial functions of parieto-preoccipital cortex in monkeys. Behavioural Brain Research 6(1): 57–77.

Mondloch, C. J. and Maurer, D. (2004) Do small white balls squeak? Pitch-object correspondences in young children. Cognitive, Affective, and Behavioral Neuroscience 4(2): 133–136.

Moore, B. C. J. (1982) An Introduction to the Psychology of Hearing. London: Academic Press.

Nagel, S. K., Carl, C., Kringe, T., Martin, R. and Konig, P. (2005) Beyond sensory substitution—learning the sixth sense. Journal of Neural Engineering 2(4): R13–26. doi: 10.1088/1741-2560/2/4/R02.

Palomaki, K. J., Tiitinen, H., Makinen, V., May, P. J. and Alku, P. (2005) Spatial processing in human auditory cortex: the effects of 3D, ITD, and ILD stimulation techniques. Brain Research. Cognitive Brain Research 24(3): 364–379. doi: 10.1016/j. cogbrainres.2005.02.013.

Pasqualotto, A. and Esenkaya, T. (2016) Sensory substitution: the spatial updating of auditory scenes ‘mimics’ the spatial updating of visual scenes. Frontiers in Behavioral Neuroscience 10: 79.

Pasqualotto, A. and Proulx, M. J. (2012) The role of visual experience for the neural basis of spatial cognition. Neuroscience and Biobehavioral Reviews 36(4): 1179–1187. doi: 10.1016/j.neubiorev.2012.01.008.

Pasqualotto, A., Lam, J. S. and Proulx, M. J. (2013a) Congenital blindness improves semantic and episodic memory. Behavioural Brain Research 244: 162–165. doi: 10.1016/j. bbr.2013.02.005.

Pasqualotto, A., Spiller, M. J., Jansari, A. S. and Proulx, M. J. (2013b) Visual experience facilitates allocentric spatial representation. Behavioural Brain Research 236(1): 175–179. doi: 10.1016/j.bbr.2012.08.042.

Pollok, B., Schnitzler, I., Stoerig, P., Mierdorf, T. and Schnitzler, A. (2005) Image-to-sound (p.165) conversion: experience-induced plasticity in auditory cortex of blindfolded adults. Experimental Brain Research 167(2): 287–291. doi: 10.1007/s00221-005-0060-8.

Proulx, M. J. (2010) Synthetic synaesthesia and sensory substitution. Consciousness and Cognition 19(1): 501–503. doi: 10.1016/j.concog.2009.12.005.

Proulx, M. J. and Harder, A. (2008) Sensory substitution. Visual-to-auditory sensory substitution devices for the blind. Dutch Journal of Ergonomics/ Tijdschrift voor Ergonomie 33: 20–22.

Proulx, M. J. and Stoerig, P. (2006) Seeing sounds and tingling tongues: qualia in synaes-thesia and sensory substitution. Anthropology and Philosophy 7: 135–151.

Proulx, M. J., Stoerig, P., Ludowig, E. and Knoll, I. (2008) Seeing ‘where’ through the ears: effects of learning-by-doing and long-term sensory deprivation on localization based on image-to-sound substitution. PLoS One 3(3): e1840. doi: 10.1371/journal. pone.0001840.

Proulx, M. J., Brown, D. J., Pasqualotto, A. and Meijer, P. (2012) Multisensory perceptual learning and sensory substitution. Neuroscience and Biobehavioral Reviews 41: 16–25. doi: 10.1016/j.neubiorev.2012.11.017.

Proulx, M. J., Brown, D., Pasqualotto, A. and Meijer, P. (2014) Multisensory perceptual learning and sensory substitution. Neuroscience and Biobehavioral Reviews 41, 16–25. doi: 10.1016/j.neubiorev.2012.11.017.

Proulx, M. J., Gwinnutt, J., Dell’Erba, S., Levy-Tzedek, S., de Sousa, A. A. and Brown, D. J. (2016) Other ways of seeing: from behavior to neural mechanisms in the online ‘visual’ control of action with sensory substitution. Restorative Neurology and Neuroscience 34, 29–44.

Raichle, M. E. (2010) Two views of brain function. Trends in Cognitive Science 14(4): 180–190. doi: 10.1016/j.tics.2010.01.008.

Renier, L., Collignon, O., Poirier, C., Tranduy, D., Vanlierde, A., Bol, A., et al. (2005) Cross-modal activation of visual cortex during depth perception using auditory substitution of vision. NeuroImage 26(2): 573–580. doi: 10.1016/j.neuroimage.2005.01.047.

Rieser, J. J., Hill, E. W., Talor, C. R., Bradfield, A. and Rosen, S. (1992) Visual experience, visual field size, and the development of nonvisual sensitivity to the spatial structure of outdoor neighborhoods explored by walking. Journal of Experimental Psychology: General 121(2): 210–221.

Sampaio, E., Maris, S. and Bach-y-Rita, P. (2001) Brain plasticity: ‘visual’ acuity of blind persons via the tongue. Brain Research 908(2): 204–207. doi: S0006-8993(01)02667-1 [pii].

Schroger, E. (1996) Interaural time and level differences: integrated or separated processing? Hearing Research 96(1–2): 191–198.

Schurmann, M., Caetano, G., Jousmaki, V. and Hari, R. (2004) Hands help hearing: facilitatory audiotactile interaction at low sound-intensity levels. The Journal of the Acoustical Society of America 115(2): 830–832.

Segond, H., Weiss, D. and Sampaio, E. (2005) Human spatial navigation via a visuo-tactile sensory substitution system. Perception 34(10): 1231–1249.

Segond, H., Weiss, D., Kawalec, M. and Sampaio, E. (2013) Perceiving space and optical cues via a visuo-tactile sensory substitution system: a methodological approach for training of blind subjects for navigation. Perception 42(5): 508–528.

(p.166) Spence, C. and Deroy, O. (2012) Crossmodal correspondences: innate or learned? i-Perception 3(5): 316–318. doi: 10.1068/i0526ic.

Stoerig, P. and Proulx, M. J. (2008) Learning to recognize objects with an image-to-sound conversion-based sensory system. Paper presented at the Volkswagen Foundation status symposium on ‘Dynamics and adaptivity of neuronal systems’, Tübingen, Germany.

Striem-Amit, E., Guendelman, M. and Amedi, A. (2012) ‘Visual’ acuity of the congenitally blind using visual-to-auditory sensory substitution. PLoS One 7(3): e33136. doi: 10.1371/journal.pone.0033136.

Thaler, L., Arnott, S. R. and Goodale, M. A. (2011) Neural correlates of natural human echolocation in early and late blind echolocation experts. PLoS One 6(5): e20162. doi: 10.1371/journal.pone.0020162.

Treisman, A. M. and Gelade, G. (1980) A feature-integration theory of attention. Cognitive Psychology 12(1): 97–136. doi: 0010-0285(80)90005-5 [pii].

Von Helmholtz, H. (2005) Treatise on Physiological Optics, vol. 3. Mineola, NY: Dover Publications.

Walker, B. N. and Mauney, L. M. (2001) Psychophysical scaling of sonification mappings: a comparison of visually impaired and sighted listeners. Proceedings of the 2001 International Conference on Auditory Display, Espoo, Finland, 29 July–1 August.

Walker, P. and Smith, S. (1984) Stroop interference based on the synaesthetic qualities of auditory pitch. Perception 13(1): 75–81.