kotahi

Metadata

eLife Assessment

This study provides important insights into how researchers can use perceptual metamers to formally explore the limits of visual representations at different processing stages. The framework is compelling and the data largely support the claims, subject to minor caveats.

Reviewer #1 (Public review):

This is an interesting study on the nature of representations across the visual field. The question of how peripheral vision differs from foveal vision is a fascinating and important one. The majority of our visual field is extra-foveal yet our sensory and perceptual capabilities decline in pronounced and well-documented ways away from the fovea. Part of the decline is thought to be due to spatial averaging ('pooling') of features. Here, the authors contrast two models of such feature pooling with human judgments of image content. They use much larger visual stimuli than in most previous studies, and some sophisticated image synthesis methods to tease apart the prediction of the distinct models.

More importantly, in so doing, the researchers thoroughly explore the general approach of probing visual representations through metamers-stimuli that are physically distinct but perceptually indistinguishable. The work is embedded within a rigorous and general mathematical framework for expressing equivalence classes of images and how visual representations influence these. They describe how image-computable models can be used to make predictions about metamers, which can then be compared to make inferences about the underlying sensory representations. The main merit of the work lies in providing a formal framework for reasoning about metamers and their implications, for comparing models of sensory processing in terms of the metamers that they predict, and for mapping such models onto physiology. Importantly, they also consider the limits of what can be inferred about sensory processing from metamers derived from different models.

Overall, the work is of a very high standard and represents a significant advance over our current understanding of perceptual representations of image structure at different locations across the visual field. The authors do a good job of capturing the limits of their approach I particularly appreciated the detailed and thoughtful Discussion section and the suggestion to extend the metamer-based approach described in the MS with observer models. The work will have an impact on researchers studying many different aspects of visual function including texture perception, crowding, natural image statistics and the physiology of low- and mid-level vision.

The main weaknesses of the original submission relate to the writing. A clearer motivation could have been provided for the specific models that they consider, and the text could have been written in a more didactic and easy to follow manner. The authors could also have been more explicit about the assumptions that they make.

Comments following re-submission:

Overall, I think the authors have done a satisfactory job of addressing most of the points I raised.

There's one final issue which I think still needs better discussion.

I think reviewer 2 articulated better than I have the point I was concerned about: the relationship between JNDs and metamers as depicted in the schematics and indeed in the whole conceptualization.

I think the issue here is that there seems to be a conflating of two concepts- 'subthreshold' and 'metamer'-and I'm not convinced it is entirely unproblematic. It's true that two stimuli that cannot be discriminated from one another due to the physical differences being too small to detect reliably by the visual system are a form of metamer in the strict definition 'physically different, but perceptually the same'.
However, I don't think this is the scientifically substantial notion of metamer that enabled insights into trichromacy. That form of metamerism is due to the principle of univariance in feature encoding, and involves conditions in which physically very different stimuli are mapped to one and the same point in sensory encoding space whether or not there is any noise in the system. When I say 'physically very different' I mean different by a large enough amount that they would be far above threshold, potentially orders of magnitude larger than a JND if the system's noise properties were identical but the system used a different sensory basis set to measure them. This seems to be a very different kind of 'physically different, but perceptually the same'.

I do think the notion of metamerism can obviously be very usefully extended beyond photoreceptors and photon absorptions. In the interesting case of texture metamers, what I think is meant is that stimuli would be discriminable if scrutinised in the fovea, but because they have the same statistics they are treated as equivalent. I think the discussion of this could still be clearly articulated in the manuscript. It would benefit from a more thorough discussion of the difference between metamerism and subthreshold, especially in the context of the Voronoi diagrams at the beginning.

It needs to be made clear to the reader why it is that two stimuli that are physically similar (e.g., just spanning one of the edges in the diagram) can be discriminable, while at the same time, two stimuli that are very different (e.g., at opposite ends of a cell) can't.

Do the cells include BOTH those sets of stimuli that cannot be discriminated just because of internal noise AND those that can't be discriminated because they are projected to literally the same point in the sensory encoding space? What are the strengths and limits of models that involve the strict binarization of sensory representations, and how can they be integrated with models dealing with continuous differences? These seem like important background concepts that ought to be included in either the introduction of discussion sections. In this context it might also be helpful to refer to the notion of 'visual equivalence' as described by:

Ramanarayanan, G., Ferwerda, J., Walter, B., & Bala, K. (2007). Visual equivalence: towards a new standard for image fidelity. ACM Transactions on Graphics (TOG), 26(3), 76-es.

Other than that, I congratulate the authors on a very interesting study, and look forward to reading the final version.

Reviewer #2 (Public review):

Summary:

The authors have improved clarity overall and have spoken to most of the issues raised by the reviewers. There are still two outstanding problems however, where issues raised during the review were inappropriately dismissed in the manuscript. These should be explicitly addressed as limitations to the results presented (no eye tracking), and early pilot experiments that informed the experiments as presented (pink noise) rather than brushed off as 'unnecessary' and 'would be uninformative'.

Eye tracking:

It is generally accepted that experiments testing stimuli presented at specific locations in peripheral vision require eye tracking to ensure that the stimulus is presented as expected, in particular, in the correct location. As I stated in the previous round of review, while a stimulus presentation time of 200ms does help eliminate some saccades, it does not eliminate the possibility that subjects were not fixating well during stimulus onset. I am also unclear what the authors mean by 'trained observer' in this context, though the authors state that an author subject in a different portion of the paper is an 'expert observer'. Does this mean the 'trained observers' are non-expert recruited subjects? Given the conditions tested differ from previous work (Freeman & Simoncelli, 2011) *these differences are a main contribution of the paper!* which DID include eye tracking in a subset of subjects, it is entirely possible to get similar results to this work in the context of non eye-tracking controlled stimulus presentation. The reasons now in the manuscript are not reasons that make eye tracking 'considered unnecessary'.

I appreciate that the authors now state the lack of eye tracking explicitly, but believe the paper needs to at least state that this is a limitation of the results reported, and eyetracking being 'considered unnecessary' is unreasonable, nor a norm in this subfield.

N=1: The authors now state clearly the limitations of a single subject in the manuscript, and state the expertise level of this subject.

Large number of trials: The authors now address this and include an enumeration of the large number of trials.

Simple Models / Physiology comparison: I support the choice to reduce claims regarding tight connections to physiology, and appreciate the explanation of the luminance model.

Previous Work: I appreciate the author's changes to the introduction, both in discussing previous work and citation fixes.

Blurred White, Pink Noise: While the authors now address pink noise, the explanation for such stimuli being expected to be uninformative is confusing to me. The manuscript now first states that pink noise is a natural choice, then claims it would be uninformative, while also stating in the rebuttal (not the manuscript) that they tried it and it indeed reduced the artifacts they note. The logic of the experiments indeed relies on finding the smallest critical scaling value, which is measured by subjects determining if a synthesis is similar or different to a target or second synth. A synthesis free from artifacts would surely affect the subjects responses and the smallest critical scaling measured.

The statement that the authors experimented with pink noise early on and found this able to address the artifacts should be stated in the manuscript itself, not just in the rebuttal, and the blanket statement that this experiment would be 'uninformative' is incorrect. Surely this early pilot the authors mention in the rebuttal was informative to designing the experiments that appear in the final paper, and would be an informative experiment to include.

Author response:

The following is the authors’ response to the original reviews.

Reviewer #1 (Public Review):

This is an interesting study of the nature of representations across the visual field. The question of how peripheral vision differs from foveal vision is a fascinating and important one. The majority of our visual field is extra-foveal yet our sensory and perceptual capabilities decline in pronounced and well-documented ways away from the fovea. Part of the decline is thought to be due to spatial averaging (’pooling’) of features. Here, the authors contrast two models of such feature pooling with human judgments of image content. They use much larger visual stimuli than in most previous studies, and some sophisticated image synthesis methods to tease apart the prediction of the distinct models.

More importantly, in so doing, the researchers thoroughly explore the general approach of probing visual representations through metamers-stimuli that are physically distinct but perceptually indistinguishable. The work is embedded within a rigorous and general mathematical framework for expressing equivalence classes of images and how visual representations influence these. They describe how image-computable models can be used to make predictions about metamers, which can then be compared to make inferences about the underlying sensory representations. The main merit of the work lies in providing a formal framework for reasoning about metamers and their implications, for comparing models of sensory processing in terms of the metamers that they predict, and for mapping such models onto physiology. Importantly, they also consider the limits of what can be inferred about sensory processing from metamers derived from different models.

Overall, the work is of a very high standard and represents a significant advance over our current understanding of perceptual representations of image structure at different locations across the visual field. The authors do a good job of capturing the limits of their approach and I particularly appreciated the detailed and thoughtful Discussion section and the suggestion to extend the metamer-based approach described in the MS with observer models. The work will have an impact on researchers studying many different aspects of visual function including texture perception, crowding, natural image statistics, and the physiology of low- and mid-level vision.

The main weaknesses of the original submission relate to the writing. A clearer motivation could have been provided for the specific models that they consider, and the text could have been written in a more didactic and easy-to-follow manner. The authors could also have been more explicit about the assumptions that they make.

Thank you for the summary. We appreciate the positives noted above. We address the weaknesses point by point below.

Reviewer #2 (Public Review):

Summary

This paper expands on the literature on spatial metamers, evaluating different aspects of spatial metamers including the effect of different models and initialization conditions, as well as the relationship between metamers of the human visual system and metamers for a model. The authors conduct psychophysics experiments testing variations of metamer synthesis parameters including type of target image, scaling factor, and initialization parameters, and also compare two different metamer models (luminance vs energy). An additional contribution is doing this for a field of view larger than has been explored previously

General Comments

Overall, this paper addresses some important outstanding questions regarding comparing original to synthesized images in metamer experiments and begins to explore the effect of noise vs image seed on the resulting syntheses. While the paper tests some model classes that could be better motivated, and the results are not particularly groundbreaking, the contributions are convincing and undoubtedly important to the field. The paper includes an interesting Voronoi-like schematic of how to think about perceptual metamers, which I found helpful, but for which I do have some questions and suggestions. I also have some major concerns regarding incomplete psychophysical methodology including lack of eye-tracking, results inferred from a single subject, and a huge number of trials. I have only minor typographical criticisms and suggestions to improve clarity. The authors also use very good data reproducibility practices.

Thank you for the summary. We appreciate the positives noted above. We address the weaknesses point by point below.

Specific Comments

Experimental Setup

Firstly, the experiments do not appear to utilize an eye tracker to monitor fixation. Without eye tracking or another manipulation to ensure fixation, we cannot ensure the subjects were fixating the center of the image, and viewing the metamer as intended. While the short stimulus time (200ms) can help minimize eye movements, this does not guarantee that subjects began the trial with correct fixation, especially in such a long experiment. While Covid-19 did at one point limit in-person eye-tracked experiments, the paper reports no such restrictions that would have made the addition of eye-tracking impossible. While such a large-scale experiment may be difficult to repeat with the addition of eye tracking, the paper would be greatly improved with, at a minimum, an explanation as to why eye tracking was not included.

Addressed on pg. 25, starting on line 658.

Secondly, many of the comparisons later in the paper (Figures 9,10) are made from a single subject. N=1 is not typically accepted as sufficient to draw conclusions in such a psychophysics experiment. Again, if there were restrictions limiting this it should be discussed. Also (P11) Is subject sub-00 is this an author? Other expert? A naive subject? The subject’s expertise in viewing metamers will likely affect their performance.

Addressed on pg. 14, starting on line 308.

Finally, the number of trials per subject is quite large. 13,000 over 9 sessions is much larger than most human experiments in this area. The reason for this should be justified.

In general, we needed a large number of trials to fit full psychometric functions for stimuli derived for both models, with both types of comparison, both initializations, and over many target images. We could have eliminated some of these, but feel that having a consistent dataset across all these conditions is a strength of the paper.

In addition to the sentence on pg. 14, line 318, a full enumeration of trials is now described on pg. 23, starting on line 580.

Model

For the main experiment, the authors compare the results of two models: a ’luminance model’ that spatially pools mean luminance values, and an ’energy model’ that spatially pools energy calculated from a multi-scale pyramid decomposition. They show that these models create metamers that result in different thresholds for human performance, and therefore different critical scaling parameters, with the basic luminance pooling model producing a scaling factor 1/4 that of the energy model. While this is certain to be true, due to the luminance model being so much simpler, the motivation for the simple luminance-based model as a comparison is unclear.

The use of simple models is now addressed on pg. 3, starting on line 98, as well as the sentence starting on pg. 4 line 148: the luminance model is intended as the simplest possible pooling model.

The authors claim that this luminance model captures the response of retinal ganglion cells, often modeled as a center-surround operation (Rodieck, 1964). I am unclear in what aspect(s) the authors claim these center-surround neurons mimic a simple mean luminance, especially in the context of evidence supporting a much more complex role of RGCs in vision (Atick & Redlich, 1992). Why do the authors not compare the energy model to a model that captures center-surround responses instead? Do the authors mean to claim that the luminance model captures only the pooling aspects of an RGC model? This is particularly confusing as Figures 6 and 9 show the luminance and energy models for original vs synth aligning with the scaling of Midget and Parasol RGCs, respectively. These claims should be more clearly stated, and citations included to motivate this. Similarly, with the energy model, the physiological evidence is very loosely connected to the model discussed.

We have removed the bars showing potential scaling values measured by electrophysiology in the primate visual system and attempted to clarify our language around the relationship between these models and physiology. Our metamer models are only loosely connected to the physiology, and we’ve decided in revision not to imply any direct connection between the model parameters and physiological measurements. The models should instead be understood as loosely inspired by physiology, but not as a tool to localize the representation (as was done in the Freeman paper).

The physiological scaling values are still used as the mean of the priors on the critical scaling value for model fitting, as described on pg. 27, starting on line 698.

Prior Work:

While the explorations in this paper clearly have value, it does not present any particularly groundbreaking results, and those reported are consistent with previous literature.The explorations around critical eccentricity measurement have been done for texture models (Figure 11) in multiple papers (Freeman 2011, Wallis, 2019, Balas 2009). In particular, Freeman 20111 demonstrated that simpler models, representing measurements presumed to occur earlier in visual processing need smaller pooling regions to achieve metamerism. This work’s measurements for the simpler models tested here are consistent with those results, though the model details are different. In addition, Brown, 2023 (which is miscited) also used an extended field of view (though not as large as in this work). Both Brown 2023, and Wallis 2019 performed an exploration of the effect of the target image. Also, much of the more recent previous work uses color images, while the author’s exploration is only done for greyscale.

We were pleased to find consistency of our results with previous studies, given the (many) differences in stimuli and experimental conditions (especially viewing angle), while also extending to new results with the luminance model, and the effects of initialization. Note that only one of the previous studies (Freeman and Simoncelli, 2011) used a pooled spectral energy model. Moreover, of the previous studies, only one (Brown et al., 2023) used color images (we have corrected that citation - thanks for catching the error).

Discussion of Prior Work:

The prior work on testing metamerism between original vs. synthesized and synthesized vs. synthesized images is presented in a misleading way. Wallis et al.’s prior work on this should not be a minor remark in the post-experiment discussion. Rather, it was surely a motivation for the experiment. The text should make this clear; a discussion of Wallis et al. should appear at the start of that section. The authors similarly cite much of the most relevant literature in this area as a minor remark at the end of the introduction (P3L72).

The large differences we observed between comparison types (original vs synthesized, compared to synthesized vs synthesized) surprised us. Understanding such difference was not a primary motivation for the work, but it is certainly an important component of our results. In the introduction, we thought it best to lay out the basic logic of the metamer paradigm for foveated vision before mentioning the complications that are introduced in both the Wallis and Brown papers (paragraph beginning p. 3, line 109). Our results confirm and bolster the results of both of those earlier works, which are now discussed more fully in the Introduction (lines 109 and following).

White Noise: The authors make an analogy to the inability of humans to distinguish samples of white noise. It is unclear however that human difficulty distinguishing samples of white noise is a perceptual issue- It could instead perhaps be due to cognitive/memory limitations. If one concentrates on an individual patch one can usually tell apart two samples. Support for these difficulties emerging from perceptual limitations, or a discussion of the possibility of these limitations being more cognitive should be discussed, or a different analogy employed.

We now note the possibility of cognitive limits on pg. 8, starting on line 243, as well as pg. 22, line 571. The ability of observers to distinguish samples of white noise is highly dependent on display conditions. A small patch of noise (i.e., large pixels, not too many) can be distinguished, but a larger patch cannot, especially when presented in the periphery. This is more generally true for textures (as shown in Ziemba and Simoncelli (2021)). Samples of white noise at the resolution used in our study are indistinguishable.

Relatedly, in Figure 14, the authors do not explain why the white noise seeds would be more likely to produce syntheses that end up in different human equivalence classes.

In figure 14, we claim that white noise seeds are more likely to end up in the same human equivalence classes than natural image seeds. The explanation as to why we think this may be the case is now addressed on pg. 19, starting on line 423.

It would be nice to see the effect of pink noise seeds, which mirror the power spectrum of natural images, but do not contain the same structure as natural images - this may address the artifacts noted in Figure 9b.

The lack of pink noise seeds is now addressed on pg. 19, starting on line 429.

Finally, the authors note high-frequency artifacts in Figure 4 & P5L135, that remain after syntheses from the luminance model. They hypothesize that this is due to a lack of constraints on frequencies above that defined by the pooling region size. Could these be addressed with a white noise image seed that is pre-blurred with a low pass filter removing the frequencies above the spatial frequency constrained at the given eccentricity?

The explanation for this is similar to the lack of pink noise seeds in the previous point: the goal of metamer synthesis is model testing, and so for a given model, we want to find model metamers that result in the smallest possible critical scaling value. Taking white noise seed images and blurring them will almost certainly remove the high frequencies visible in luminance metamers in figure 4 and thus result in a larger critical scaling value, as the reviewer points out. However, the logic of the experiments requires finding the smallest critical scaling value, and so these model metamers would be uninformative. In an early stage of the project, we did indeed synthesize model metamers using pink noise seeds, and observed that the high frequency artifacts were less prominent.

Schematic of metamerism: Figures 1,2,12, and 13 show a visual schematic of the state space of images, and their relationship to both model and human metamers. This is depicted as a Voronoi diagram, with individual images near the center of each shape, and other images that fall at different locations within the same cell producing the same human visual system response. I felt this conceptualization was helpful. However, implicitly it seems to make a distinction between metamerism and JND (just noticeable difference). I felt this would be better made explicit. In the case of JND, neighboring points, despite having different visual system responses, might not be distinguishable to a human observer.

Thanks for noting this – in general, metamers are subthreshold, and for the purpose of the diagram, we had to discretize the space showing metameric regions (Voronoi regions) around a set of stimuli. We’ve rewritten the captions to explain this better. We address the binary subthreshold nature of the metamer paradigm in the discussion section (pg. 19, line 438).

In these diagrams and throughout the paper, the phrase ’visual stimulus’ rather than ’image’ would improve clarity, because the location of the stimulus in relation to the fovea matters whereas the image can be interpreted as the pixels displayed on the computer.

We agree and have tried to make this change, describing this choice on pg. 3 line 73.

Other

The authors show good reproducibility practices with links to relevant code, datasets, and figures.

Reviewer #1 (Recommendations For The Authors):

In its current form, I found the introduction to be too cursory. I felt that the article would benefit from a clearer motivation for the two models that are considered as the reader is left unclear why these particular models are of special scientific significance. The luminance model is intended to capture some aspects of retinal ganglion cells response characteristics and the spectral energy model is intended to capture some aspects of the primary visual cortex. However, one can easily imagine models that include the pooling of other kinds of features, and it would be helpful to get an idea of why these are not considered. Which aspects of processing in the retina and V1 are being considered and which are being left out, and why? Why not consider representations that capture even higher-order statistical structure than those covered by the spectral energy model (or even semantics)? I think a bit of rewriting with this in mind could improve the introduction.

Along similar lines, I would have appreciated having the logic of the study explained more explicitly and didactically: which overarching research question is being asked, how it is operationalised in the models and experiments, and what are the predictions of the different models. Figures 2 and 3 are certainly helpful, but I felt further explanations would have made it easier for the reader to follow. Throughout, the writing could be improved by a careful re-reading with a view to making it easier to understand. For example, where results are presented, a sentence or two expanding on the implications would be helpful.

I think the authors could also be more explicit about the assumptions they make. While these are obviously (tacitly) included in the description of the models themselves, it would be helpful to state them more openly. To give one example, when introducing the notion of critical scaling, on p.6 the authors state as if it is a self-evident fact that "metamers can be achieved with windows whose size is matched to that of the underlying visual neurons". This presumably is true only under particular conditions, or when specific assumptions about readout from populations of neurons are invoked. It would be good to identify and state such assumptions more directly (this is partly covered in the Discussion section ’The linking proposition underlying the metamer paradigm’, but this should be anticipated or moved earlier in the text).

We agree that our introduction was too cursory and have reworked it. We have also backed off of the direct comparison to physiology and clarified that we chose these two as the simplest possible pooling models. We have also added sentences at the end of each result section attempting to summarize the implication (before discussing them fully in the discussion). Hopefully the logic and assumptions are now clearer.

There are also some findings that warrant a more extensive discussion. For example, what is the broader implication of the finding that original vs. synthesised and synthesised vs. synthesised comparisons exhibit very different scaling values? Does this tell us something about internal visual representations, or is it simply capturing something about the stimuli?

We believe this difference is a result of the stimuli that are used in the experiment and thus the synthesis procedure itself, which interacts with the model’s pooled image feature. We have attempted to update the relevant figures and discussions to clarify this, in the sections starting on pg 17 line 396 and pg. 19 line 417.

At some points in the paper, a third model (’texture model’) creeps into the discussion, without much explanation. I assume that this refers to models that consider joint (rather than marginal) statistics of wavelet responses, as in the famous Portilla & Simoncelli texture model. However, it would be helpful to the reader if the authors could explain this.

Addressed on pg. 3, starting on line 94.

Minor corrections.

Caption of Figure 3: ’top’ and ’bottom’ should be ’left’ and ’right’

Line 177: ’smallest tested scaling values tested’. Remove one instance of ’tested’

Line 212: ’the images-specific psychometric functions’ -> ’image-specific’

Line 215: ’cloud-like pink noise’. It’s not literally pink noise, so I would drop this.

Line 236: ’Importantly, these results cannot be predicted from the model, which gives no specific insight as to why some pairs are more discriminable than others’. The authors should specify what we do learn from the model if it fails to provide insight into why some image pairs are more discriminable than others.

Figure 9: it might be helpful to include small insets with the ’highway’ and ’tiles’ source images to aid the reader in understanding how the images in 9B were generated.

Table 1 placement should be after it is first referred to on line 258.

In the Discussion section "Why does critical scaling depend on the comparison being performed", it would be helpful to consider the case where the two model metamers *are* distinguishable from each other even though each is indistinguishable from the target image. I would assume that this is possible (e.g., if the target image is at the midpoint between the two model images in image space and each of the stimuli is just below 1 JND away from the target). Or is this not possible for some reason?

Regarding line 236: this specific line has been removed, and the discussion about this issue has all been consolidated in the final section of the discussion, starting on pg. 19 line 438.

Regarding the final comment: this is addressed in the paragraph starting on pg. 16 line 386. To expand upon that: the situation laid out by the reviewer is not possible in our conceptualization, in which metamerism is transitive and image discriminability is binary. In order to investigate situations like the one laid out by the reviewer, one needs models whose representations have metric properties, i.e., which allow you to measure and reason about perceptual distance, which we refer to in the paragraph starting on pg. 20 line 460. We also note that this situation has not been observed in this or any other pooling model metamer study that we are aware of. All other minor changes have been addressed.

Reviewer #2 (Recommendations For The Authors):

Original image T should be marked in the Voronoi diagrams.

Brown et al is miscited as 2021 should be ACM Transactions on Applied Perception 2023.
Figure 3 caption: models are left and right, not top and bottom.

Thanks, all of the above have been addressed.

References

BrownReral Encoding, in the Human Visual System. ACM Transactions on Applied Perception. 2023 Jan; 20(1):1–22.http://dx.doi.org/10.1145/356460, Dutell V, Walter B, Rosenholtz R, Shirley P, McGuire M, Luebke D. Efficient Dataflow Modeling of Periph-5, doi: 10.1145/3564605.

Freeman Jdoi: 10.1038/nn.2889, Simoncelli EP. Metamers of the ventral stream. Nature Neuroscience. 2011 aug; 14(9):1195–1201..

Ziemba CMnications. 2021 jul; 12(1)., Simoncelli EP. Opposing Effects of Selectivity and Invariance in Peripheral Vision. Nature Commu-https://doi.org/10.1038/s41467-021-24880-5, doi: 10.1038/s41467-021-24880-5.