Clever Robots: Understanding Artificial Intelligence in the context of our own minds

***note the slides that accompanied the Imperial Festival Presentation of this theme***  are attached here

With the sudden emergence of real-world artificial intelligence (AI) applications such as Google DeepMind’s #alphago, decisively beating human Go champion, Lee Sedol[1]; Samsung releasing an AI for breast cancer detection [2]; and Microsoft announcing a deep learning application for the blind that interprets the visual world; the future of a world with intelligent machines has become a hotly discussed topic.

Will robots soon be stealing our jobs or threatening humanity’s very existence? Although incredibly impressive current AI is still simply a pattern recognition tool. Capable of single tasks with no real true understanding of the world. We therefore take a simple look at how some of these AIs, specifically convolutional neural networks for deep learning, have been inspired by the human brain, and highlight some of their limitations.

The human brain is a dense web of connections: nerve cells bodies with long tails that allow them to connect with, and send communications to, more distant regions of the brain [4]. In very simple terms, human behaviour can be reduced to the coordinated activity of nerve fibers, where specific brain functions such as motor control, decision making, interpretation of language, and the visual word, can be all associated with a specific location in the brain. Further, the function a brain region has can be traced back to the type of nerve cells that it is made up from, and the other regions it is connected to.

Image from the MGH Human Connectome Project: a model of nerve fibers estimated using diffusion MRI [10]

Despite considerable efforts to understand: how many specialised regions exist in the human brain; how they are interconnected; how they vary across individuals; and how subtle variations in the structure of regions and connections relate to complex behavioural qualities such as IQ and different personality traits; relatively little of the brain is considered to be well understood.

One region of the brain that has been comprehensively studied is the visual system. From detailed studies in non-human primates [5] and translational studies in humans [6] it is known to be formed from a hierarchy of over 32 different brain regions with more than 300 distinct connections between them. Visual scenes map topographically onto every brain region; in other words, different parts of the visual field map onto different parts of each brain region; further each brain region can be further subdivided into separate areas each with different cell structures and different connection patterns[4] .

Within the visual system there are many specialised function pathways. However, taking object recognition as a specific example, there are eight main regions. These process visual information in a hierarchical fashion, starting with subcortical regions: the retina and the lateral geniculate nucleus (LGN), which perform low level processing before passing on to cortical processing areas: V1, V2, V4 and at the highest level the inferior temporal cortex (IT).

A figure from Van Essen and Gallant 1994 that shows the rough location (in a macaque cortex) and function of regions involved in  the visual system

When information is received by the retina, cells respond simply to variations in light and color, and the 3D world is first mapped onto a 2D topographic representation. The LGN acts as a simple high pass filter and performs contrast normalisation [5][7][8]. It contains 3 cell types: P, M and K. Of these P and M are best understood and are thought to process high spatial / low temporal resolution data  and colors (P) and rapidly changing scenes (M).

V1, the first region in the cortex to receive visual  information, acts as an edge detector. It subdivides into a series of blob-like subregions and accordingly has two key processing streams, known as blob-dominated (BD) and inter-blob (ID) dominated; relating to the compartments that they originate from. Of these blob like regions are thought to detect texture, brightness and colour and inter-blob regions are thought to detect shapes. V2 and V4 process sequentially more complex image forms, with V4 in particular being linked to involvement in form motion, and a sensitivity in particular to non cartesian image motifs (such as hyperbolic and polar spatial patterns).

At the highest level, different cells of the inferior temporal complex (subdivided into anterior, AIT, posterior, PIT, and central (CIT) are known to activate for specific shapes and forms such as complex geometrical stimuli and natural shapes such as faces or hands. These cells generally activate for a range of associated stimuli, suggesting they are tuned for generic classes of stimuli rather than specific examples of each class.

A common interpretation of neurons in the visual system is that they act as filters, tuned over several distinct dimensions, processing first  spatial frequency and edges, before detecting more complex shapes and textures, and  finally full objects. In many ways this reflects the goals of any automated image processing system, where, for some time, image detection and classification tasks were performed by manufacturing particular types of image feature for  edge  (Laplacian Pyramids, Canny filters) or texture (SURF, SIFT, Haar) detection.

However, our brains do this automatically recognising and interpreting, a literal, world of variation every second.  With this in mind Yan Le Cun pioneered Convolutional Deep Learning [5]. An artificial visual recognition system built directly from a model of the brain. Like the human brain these systems consist of several layers formed from a hierarchy of artificial neurons: specifically:

  1. A normalisation layer: this performs whitening and high pass filtering of the data. In this, principal component analysis (PCA) is performed on the image data. This is a matrix factorisation approx that represents the data in a new basis that optimally describes its variance. PCA acts as a spatial frequency detector: the top eigenvectors (corresponding to the largest eigenvalues) account for most of the variance  and correspond to lower spatial frequencies[9]
  2. A linear filtering or CONVOLUTIONAL layer: here the image is processed in a topographical sense through a series of convolutional filters (or matrix operations) that act on only part of each scene. Each different colour channel or dimension of the world is covered by a separate set of neurons.
  3. A non-linear filter layer. This is an important feature that allows complex representations of space to be separated in a similar way to kernel learning (used in support vector machines (for example). Layers 2 and 3 detect increasingly complex conjunctions of features or motifs from layer 1.
  4. Pooling layer: this aggregates a subset of values from the previous layer for dimensionality reduction to reduce the amount of parameters and computation in the network, and hence to also control overfitting. By doing this pooling layers merge semantically similar things

This four level process can be cascaded (repeated over and over) creating deeper and deeper networks from which ever more complex combinations of filters can be encoded.

Comparing the human visual system to a convolutional net. Both process the world topographically and have a hierarchy of processing layers. From [8]

The topographical mapping the science onto the network and hierarchical structure of convolutional networks closely resemble that of the human visual system. However, the behaviour or artificial neurons is still an extreme simplification of biological cells. Further, for practical reasons certain tricks are applied to reduce the parameters in the artificial network, for example by forcing the set of filters learnt for each world dimension to be constant, though sharing of information between neurons, which is unlikely to be true of the human brain

Over recent years AIs such as these have started to show really  impressive results in object recognition tasks for example Facebook’s automatic face detection [11]  and tagging system or Google’s visual search [12]. Nevertheless, all these applications are identfying images from a large, but ultimately finite, set of image classes, and are unable to extrapolate or contextualise. For example, Facebook’s feature can tag friends but cannot ***yet*** judge their mood in the photo or whether it is a bad photo that ultimately the person might not want tagged.On the other hand the human brain is able to contextualise and interpret a vast amount of different types of visual information, instantly determining, what it is is seeing and how to respond to that scene.

Current deep learning applications already require massive amounts of computing power, For a AI to simply be able to identify all feasible objects it might see would require more resources than we have available. To then be able to identify objects, assimilate sounds, understand social contexts and respond accordingly is more complex still. The human brain has more than 80 billion neurons, whereas current AIs tune roughly 20 million parameters [13].

Being human is about for more than processing power. It is highly likely that even if we could map every single human neuron, cryogenically freeze  or digitally store a human brain we would not be able to reanimate that person [15]. Very little is still understand about the origins of human consciousness, or what it means to be human on a personal or behavioural level [16]. Until we do human level AI is just science fiction, but more on all that to come …





[5] Van Essen, David C., and Jack L. Gallant. “Neural mechanisms of form and motion processing in the primate visual system.” Neuron 13.1 (1994): 1-10.

[6] Hansen, Kathleen A., Kendrick N. Kay, and Jack L. Gallant. “Topographic organization in and near human visual area V4.” The Journal of Neuroscience 27.44 (2007): 11896-11911.

[7] LeCun, Yann. “Learning invariant feature hierarchies.” Computer vision–ECCV 2012. Workshops and demonstrations. Springer Berlin Heidelberg, 2012.

[8] Yamins, Daniel LK, and James J. DiCarlo. “Using goal-driven deep learning models to understand sensory cortex.” Nature neuroscience 19.3 (2016): 356-365.





[13] Li, Guan-Nan, Min He, and Xiao-Gang He. “Some predictions of diquark model for hidden charm pentaquark discovered at the LHCb.” arXiv preprint arXiv:1507.08252 (2015).


[15]Koch, Christof, et al. “Neural correlates of consciousness: progress and problems.” Nature Reviews Neuroscience 17.5 (2016): 307-321.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s