Cover: Computational Models for Cognitive Vision, by Hiranmay Ghosh

IEEE Press
445 Hoes Lane
Piscataway, NJ 08854

IEEE Press Editorial Board
Ekram Hossain, Editor in Chief

Jón Atli BenediktssonDavid Alan GrierElya B. Joffe
Xiaoou LiPeter LianAndreas Molisch
Saeid NahavandiJeffrey ReedDiomidis Spinellis
Sarah SpurgeonAhmet Murat Tekalp

Computational Models for Cognitive Vision

 

Hiranmay Ghosh

 

Ex-Advisor, TCS Research

 

 

 

 

 

PCG Logo

About the Author

image

Hiranmay Ghosh is a researcher in Computer Vision, Artificial Intelligence, Machine Learning, and Cognitive Computing. He has received his Ph.D. degree from Electrical Engineering Department of IIT-Delhi and his B.Tech. degree in Radiophysics and Electronics from the Calcutta University.

Hiranmay had been a research adviser with Tata Consultancy Services. He had been associated with R&D and engineering activities for more than 40 years in industry and autonomous research laboratories. He had been invited to teach at Indian Institute of Technology Delhi and National Institute of Technology Karnataka as Adjunct Faculty. He is also a co-author of the book Multimedia Ontology: Representation & Applications.

He is a Senior Member of IEEE, Life Member of IUPRAI, and a Member of ACM.

Acknowledgments

This book is an outcome of my studies in cognitive vision, and presents an overview of the subject. At the outset, I am indebted to all the researchers, who have toiled to make the subject grow. While I have cited many of such research, I could not do justice to all within the finite bounds of this book. Further, my thanks go to the management of ArXiv, ACM, IEEE, Researchgate, and other digital repositories, without which access to the research papers could not be possible.

Writing of the book provided me with an opportunity to interact with several researchers in the field. My special thanks go the researchers, who have engaged in discussions, forwarded their research material, and have happily conceded to my request of reproducing them in the book. The names include Aude Oliva, Calden Wloka, Charles Kemp, David Vernon, Guanbin Li, Guilin Liu, Harald Haarmann, J.K. Aggarwal, Jason Yosinski, John Tsotsos, Marcin Andrychowicz, Nicholas Wade, Pritty Patel-Grosz, Roberto Cipolla, Rudra Poudel, S.M. Reza Soroushmehr, Sangho Park, Stan Franklin, Sumit Chopra, Ulrich Engelke, and V. Badrinarayan. Further, many illustrations in the book have been borrowed from the Wikimedia Commons library, and I thank their authors for creating those diagrams and licensing them for general use, and the Wikimedia management for providing a platform for their sharing.

I thank the management of the TCS Research team and my colleagues, especially K. Ananth Krishnan, P. Balamurali, and Hrishikesh Sharma for providing me an environment and support for conductingmy studies. I thank Prof. V.S. Subrahmanian, Prof. Ramesh Jain, and Prof. Santanu Chaudhury for encouraging me to write the book.

My thanks go to the management and the production team of IEEE Press and John Wiley & Sons to take up the publication of the book. Special mentions go to Mary Hatcher, Victoria Bradshaw, Louis V. Manoharan, and Gayathree Sekar, for providing me with the necessary support at various stages of authoring and production. Finally, I thank my spouse Sharmila for encouraging and supporting me at every stage of my research and authoring this book and for selecting an appropriate cover for the book.

–HiranmayGhosh

Preface

As I started on my research on cognitive vision a few years back, it became apparent that there is no up-to-date textbook on the subject. There has been tremendous research on cognitive vision in the recent years, and rich material lie scattered in myriads of articles and other scientific publications. This had been my prime motivation in compiling the material as a coherent narrative in form of this book, which may initiate a reader into various aspects of the subject.

As I proceeded with my research, I realized that cognitive vision is still an immature technology. It is struggling its way to attain its ambitious goal of achieving the versatility and capabilities of the human vision system. It was also evident that the scope of cognitive vision is ill-defined. There is just not one single way to emulate human vision, and the researchers have trodden diverse paths. The gamut of research appears to be like islands in an ocean, a good part of which is yet to be traversed. This provided formidable difficulty in organizing the book in a linear and cohesive manner. The sequence that I finally settled on is one of the many possible alternatives.

This book does not represent my contribution to the subject, but collates the work from many researchers to create a coherent narrative with a wide coverage. It is primarily intended for the academic as well as the industry researchers, who wants to explore the arena of cognitive vision and apply it on some real-life problems. The goal of the book is to demystify many mysteries that the human visual system holds, and to provide their computational models that can be realized in artificial vision systems. Since cognitive vision is a vast subject, it has not been possible to cover it exhaustively, in the finite expanse of this book. To overcome the shortcoming, I have tried to provide as many references as possible for the readers to explore the subject further. I have consciously given preference to surveys and reviews that provide many more pointers to rich research on the individual topics. Further, since cognitive vision is a fast growing research area, I have tried to cover as much as recent research as possible, without compromising on classical text on the subject. Nevertheless, these citations are not exhaustive, but provide just a sample of the major research directions.

Use of this book as the course material on the subject is also envisaged. However, my suggestion will be to restrict the number of topics and to expand on them. Especially, the realization of cognitive vision through deep learning in emergent architecture, which is briefly reviewed in Chapter 8, can be a subject in itself and be dealt with independently.

I hope that the book will be beneficial to the academic as well as the industrial community, for some significant period of time to come.

Hiranmay Ghosh

Acronyms

AB-RNN attention-based recurrent neural network
AGC automatic gain control
AGI artificial general intelligence
AI artificial intelligence
AIM attention-based information maximization
ANN artificial neural network
AUC-ROC area under the curve – receiver operating characteristics
AVC advanced video coding
BN Bayesian network
BNN Bayesian neural network
BRNN bi-directional recurrent neural network
CAM content addressable memory
CIE international commission on illumination
CNN convolutional neural network
CP Cognitive Program
CR consequential region
CRF conditional random field
CSM current situational model
DBN dynamic Bayesian network
DL deep learning
DL description logics
DNN deep neural network
DoG difference of Gaussians
FCNN fully convolutional neural network
FDM fixation density map
FoL first order logic
FR full reference (visual quality assessment)
GAN generative adversarial network
GNN graph neural network
GPU graphics processing unit
GW global workspace
HBM hierarchical Bayesian model
HBN hierarchical Bayesian network
HDR high dynamic range
HMM hidden Markov model
HRI human–robot interaction
HSE human social environment
HSV hue-saturation-value
HVS human vision system
ICD Indian classical dance
ICH intangible cultural heritage
KLD Kullback–Leibler divergence
LDA latent Dirichlet allocation
LDR low dynamic range
LIDA Learning Intelligent Distribution Agent
LoG Laplacian of Gaussian
LOTH language of thought hypothesis
LSTM long short term memory
LTM long-term memory
MCMC Markov chain Monte Carlo
MEBN multi-entity Bayesian network
MOWL Multimedia Web Ontology Language
MSE mean square error
MTL multitask learning
NLP natural language processing
NR no reference (visual quality assessment)
NSS natural scene statistics
NTM neural turing machine
OWL web ontology language
PAM perceptual associative memory
PLCC Pearson linear correlation coefficient
PSNR peak signal to noise ratio
RAM recurrent attention model
RBS rule based systems
RGB red–green–blue
RNN recurrent neural network
RR reduced reference (visual quality assessment)
RTM representational theory of mind
SALICON SALIency in CONtext
SGP symbol grounding problem
SLAM simultaneous localization and mapping
SMC sensori-motor contingencies
SSIM structural similarity index measure
ST selective tuning (attention model)
STAR selective tuning attentive reference
STM short term memory
SURF speeded up robust features
SWRL semantic web rule language
TCS TATA Consultancy Services
VQA visual query answering
W3C World-Wide Web Consortium
WTA winner take all

1
Introduction

Human vision system (HVS) has a remarkable capability of building three-dimensional models of the environment from the visual signals received through the eyes. The goal of computer vision research is to emulate this capability on man-made apparatus, such as computers. Twentieth century saw a tremendous growth in the field of computer vision. Starting with signal processing techniques for demarcating objects in space-time continuum of visual signals, the field has embraced several other disciplines like artificial intelligence and machine learning for interpreting the visual contents. As the research in computer vision matured, it has been pushed to address several real-life problems toward the turn of the century. Examples of such challenging applications include visual surveillance, medical image analysis, computational photography, digital heritage, robotic navigation, and so on.

Though computer vision has shown extremely promising results in many of applications in restricted domains, its performance lags that of HVS by a large margin. While HVS can effortlessly interpret complex scenes, e.g. those shown in Figure 1.1, artificial vision fails to do so. It is “intuitive” for humans to comprehend the semantics of the scenes at multiple levels of abstraction, and to predict the next movements with some degree of certainty. Derivation of such semantics remains a formidable challenge for artificial vision systems. Further, many real-life applications demand analysis of imperfect imagery, for example with poor lighting, blur, occlusions, noise, background clutter, and so forth. While human vision is robust to such imperfections, computer vision systems often fail to perform in such cases. These revelations motivated deeper study of HVS and to apply the principles involved into computer vision applications.

Photographs depicting hard challenges for computer vision. (Left) “The offensive player … is about to shoot the ball at the goal …”. (Right) A facial expression in Bharatnatyam dance.

Figure 1.1 Hard challenges for computer vision. (a) “The offensive player images is about to shoot the ball at the goal images” (b) A facial expression in Bharatnatyam dance.

Source: File shared by Rick Dikeman through Wikimedia Commons, file name: Football_iu_1996.jpg.

Source: File shared by Suyash Dwivedi through Wikimedia Commons, file name: Bharatnatyam_different_facial_expressions_(9).jpg.

1.1 What Is Cognitive Vision

Though there is a broad agreement in the scientific community that cognitive vision pertains to application of principles of biological (especially, human) vision systems to computer vision applications, the space of cognitive vision studies are not well defined (Vernon 2006). The boundary between vision and cognition is thin, and cognitive vision operates in that gray area. Broadly speaking, cognitive vision involves the ability to survey a visual scene, recognizing and locating objects of interest, acting based on visual stimuli, learning and generation of new knowledge, dynamically updating a visual map that represents the reality, and so on. Perception and reasoning are two important pillars on which cognitive vision stands. A crucial point is that the entire gamut of activities must be in real-time to enable an agent to engage with the real world. It is an emerging area of research integrating methodologies from various disciplines like artificial intelligence, computer vision, machine learning, cognitive science, and psychology. There is no single approach to cognitive vision, and the proposed solutions to the different problems appears like islands in an ocean. In this book, we have attempted to put together computational theories for a set of cognitive vision problems and organized it in an attempt to develop a coherent narrative for the subject. We shall get more insight on what cognitive vision is as we proceed through the book, and shall characterize it in clearer terms in Chapter 10.

1.2 Computational Approaches for Cognitive Vision

Two branches of science have significantly contributed to the understanding of the processes for cognition from visual as well as other sensory signals. One of them is psychophysics, which is defined as the “study of quantitative relations between psychological events and physical events or, more specifically, between sensations and the stimuli that produce them” (Encyclopedia Britannica). The subject was established by Gustav Fechner and is a marriage between study of sensory processes and physical stimuli. The other branch of science that has facilitated our understanding of perception and cognition is neurophysiology, which combines physiology and neural sciences for an understanding of the functions of the nervous system. The two approaches are complementary to each other. While psychophysics answers what happens during cognition, neurophysiology explains how it is realized in the biological nervous system.

Researchers on cognitive vision have for long recognized it as an information processing activity by the biological neural system. However, a formal computational approach to understand cognition has been a fundamental contribution by David Marr (1976). Marr abstracted vision into three separable layers, namely (i) hardware, (ii) representation and algorithms, and (iii) computational theory. This abstraction enables computational theories of cognitive vision to be formulated independent of implementations in biological vision system. It also provides a theory for realizing cognitive functions in artificial systems made up of altogether different hardware, and possibly using different representations and algorithms. Further, Marr's model of vision assumes modularity and pipelined architecture, two important properties of information processing systems that allow independent formulation of the different cognitive processes with defined interfaces. Marr identifies three stages of processing for vision. The first involves finding the basic contours that mark the object boundaries. The second stage results in discovery of the surfaces and their orientations, that results in an observer-centric images-dimensional model. The third involves knowledge-based interpretation of the model to an observer-neutral set of objects that constitute the 3D environment. These three stages roughly correspond to the early vision, perception, and cognition stages of vision, as recognized in the modern literature, and which we shall describe shortly.

As suggested by David Marr, it is possible to study computational theories of cognitive vision in isolation from the biological systems, and we propose to do exactly the same in this book. However, such computational models need to explain the what part of cognition. For that purpose, we shall refer to the results of the psychophysical experiments, wherever relevant, without going into details of the experimental setups. Further, though the goal of computational modeling is to support alternate (artificial) implementations of cognition that need not be based on biological implementation models, analysis of the latter often provides clue to plausible implementation schemes. We shall discuss the results of some relevant neurophysiological studies in the book. We shall consciously keep such discussions to a superficial level, so that the text can be followed without a deep knowledge of either psychology or neurosciences.

1.3 A Brief Review of Human Vision System

We briefly look into how human vision works in this section, in order to put rest of the text in this book in context. A broad overview of HVS is presented in Figure 1.2. It comprises a pair of eyes connected to the brain via the optic nerves. When one looks at a scene, the light rays enter the eyes to form a pair of inverted images on screens at the back of the eyes, which are known as the retina. This corresponds to mapping of the external 3D world to a pair of 2D images, with slightly different perspectives. Internal representations of the images are transmitted to the visual cortex in the rear end of the brain by a bunch of optic nerves, where the images are correlated and interpreted to reconstruct a symbolic description of the 3D world.

In this simple model of biological vision, the eyes primarily act as image capture device in the system, and the brain as the interpreter. In reality, things are much more complex. The output from the eyes is not a faithful reproduction of the images received. Significant transformations takes place on the retina, which enables efficient identification of object contours and their movements. These transformations are collectively referred to as early vision. Further processing in the neural circuits of the brain that results in interpretation of the signals received from the eye is known as late vision. The goal of late vision is to establish what and where of the objects located in the scene. It is believed that there are two distinct pathways in human brain, ventral and dorsal, through which visual information is processed, to answer these two aspects of vision (Milner and Goodale 1995). This has been emulated in several artificial vision systems, as we shall see in the following chapters of this book.

One of the initial tasks in the late vision system is to correlate the images received from the two eyes, which is facilitated by the criss-cross connection of the optic nerves connecting the eyes with the brain. Further, the late vision system achieves progressive abstraction of the correlated images and leads to perception and cognition, which we discuss in some details in Section 1.4.

A broad overview of human vision system comprising a pair of eyes connected to the brain via the optic nerves, depicting a scene where the light rays enter the eyes to form a pair of inverted images on screens at the back.

Figure 1.2 An overview of human vision system.

Source: Derivative work from file shared by Wiley through Wikimedia Commons, file name: Wiley_Human_Visual_System.gif.

1.4 Perception and Cognition

The first step in interpreting retinal images involves organization of visual data, the isolated patterns on the retina, to create a coherent interpretation of the environment. This stage is known as perception. Though we shall focus on visual perception in this book, biological perception generally results in a coordinated organization of inputs from all sensory organs. For example, the human beings create a coordinated interpretation of visual and haptic signals while grabbing an object. For an artificial agent, for example a driver-less car, perception involves all the sensors that it is equipped with. In philosophical terms, perception is about asserting a truth about the environment by processing sensory data. However, the “assertion” by an agent can be different from the reality, e.g. a vehicle seen through the convex side-view mirrors of a car may be perceived to be farther than it actually is. Such “erroneous” perceptions often lead to illusions, some of which we shall discuss in Chapters 2 and 4 of this book. Some authors prefer to include a capability to respond to the percepts in connotation of perception.

Cognition refers to an experiential interpretation of the percepts. It involves reasoning about the properties of percepts with the background knowledge and experience that an agent possesses. Depending of the knowledge-level of the agent, there can be many different levels of interpretation for the percepts. For example, Figure 1.1b can be interpreted in many ways with progressive levels of abstraction, such as a human face, a classical dance form, or an emotion expressed. Cognition may also result in “correcting” the erroneous perceptions, using specific domain knowledge. For example, the knowledge of the properties of a convex mirror results in a more realistic estimate of the distance of an object seen through a side-view mirror of a car. Cognition involves the intentional state of an agent as well. For example, while driving an automobile, a driver analyzes the visual (and audio) percepts with an objective of reaching the destination while ensuring safety and being compliant the traffic rules. In the process, the driver may focus on the road in the front, and the traffic lights, ignoring other percepts, such as the signage on the shop-fronts bordering the street. Such selective filtering of sensory data is known as attention. It is necessary to prevent the cognitive agent to be swamped with huge volume of information that it cannot process.

Thus, we find that cognition involves not only interpretation of the sensory signals but also many other factors, such as intention, knowledge, attention, and memory of an agent. Moreover, the knowledge of a cognitive agent needs to be continuously updated for it to adapt to a new environment and to respond to yet unforeseen situations. For example, while driving on a hilly road, a city-driver needs to quickly learn the specific skills for hill driving to ensure a safe journey. The process through which the knowledge is updated is called learning, and is a critical requirement for a real-life agent.

A simplified process model in a cognitive system depicting the  percepts filtered through attention mechanism, entering the cognitive interpretation resulting in signals to control further data acquisition and perception.

Figure 1.3 A simple process model in a cognitive system.

The fundamental difference between perception and cognition is that the former results in acquisition of new information through sensory organs, while the latter is the process of experiential analysis of the acquired information with some intention. There is, however, a strong interaction between these two processes. Percepts, filtered through attention mechanism, enters the cognitive process. On the other hand, cognitive interpretation of percepts results in signals to control further data acquisition and perception. This ensures need-based just-in-time visual data collection based on the intention of a cognitive agent, which is also known as active vision. Moreover, discovery of new semantic patterns through the process of cognition leads to update in the knowledge store of an agent. A simplified process model in a cognitive system is shown in Figure 1.3.

1.5 Organization of the Book

The characterization of cognitive vision and its various stages presented above, sets the background of the rest of this book. We begin with early vision system in Chapter 2, where we describe the transformations that an image goes through by the actions of the neural cells on the retina. In Chapter 3, we introduce Bayesian reasoning framework, which will be used to explain many of the perceptual and cognitive processes in the later chapters. We explain several perceptual and cognitive processes in Chapter 4. Chapter 5 deals with visual attention, the gateway between the world of perception and the world of cognition.

While the earlier chapters describe the individual processes of perception and cognition, they need to be integrated in an architectural framework for realization of cognitive systems. We characterize cognitive architectures, discuss their generic properties, and review a few popular and contemporary architectures as examples, in Chapter 6. While the architectures provide generic cognitive capabilities and interaction with the environment, we focus on the functions for cognitive vision in these architectures. Knowledge is a crucial ingredient of a cognitive system, and we introduce classical approaches to its representation in Chapter 7.

There is huge corpus of recent research that attempts to emulate the biological vision system with artificial neural networks and aims to learn the cognitive processes with deep learning techniques. A discourse on cognitive vision cannot be complete without them. We present a cross-section of the research in Chapter 8. In this chapter, we elaborate on the various modes of learning capabilities that a real-life agent need to possess, and that have been realized with deep learning techniques.

We discuss a few real-life applications for visual cognition in Chapter 9 and illustrate the use of the principles of cognitive vision. In Chapter 10, we take a look through a rear-view mirror to review what we studied, which enables us to characterize cognitive vision in more concrete terms. Further, we compare the two complementary paradigms of cognition, namely classicist and connectionist approaches, and discuss a possible synergy between the two that may be on the anvil.

Finally, a few words about the content of the book. Computational theories of cognitive vision is a vast subject, and it is not possible to cover all of it in the extent of one book. I have tried to condense as much of information as possible in this book, without sacrificing understandability, and have provided ample number of references for interested readers to explore the subject further. While providing the citations, I have given preference to authentic reviews and tutorials that should enable a reader to get an overview of the subject, and which may lead an inquisitive reader to many other relevant research publications. Also, cognitive vision being a rapidly evolving subject, I have tried to cover as much as recent material as possible, without ignoring classic text in this subject. Though I focus on cognitive vision, many of the principles of perception and cognition discussed in the book is not exclusive to visual system alone, but holds good for other sensory mechanisms as well.