Perception
Written by Pavol Durisek- Details
-
Category: Articles
-
Published: 19 June 2018
-
Hits: 6457
(very early draft)
Introduction
From 60s to 80s a huge effort was being made to develop a computer vision system. Despite early enthusiasm only one solution was giving some significant and practical results to solve this problem in general, the neural networks which now is just re-branded as deep neural networks. After that, the search for understanding visual perception slowed rapidly down. Although developing artificial intelligence shared similar experience, its development is still active and new ideas seems to occasionally appear.
There are many views and believes about the theory of intelligence and for some time now there have been an effort to build one. We also know that it can be done because we already have one (in a biological world) but we don’t know how to build it or if we are able to build it. It is said that intelligence is a universal tool to solve difficult problems and also that it tries to find an optimal solution for a problem. It is our general view but is it really the case? Its seems to me to be a similar misconception as when we try to test computers on human intelligence by giving them task that are difficult for us, like solving mathematical problems, playing chess or computer games or composing music. We do test humans on those because they are difficult for us. If we want to test how computers are human alike we should test them by giving them tasks that are easy to us, so easy we don't even notice, like task from (visual) perception, yet still unchallenged by today's computer.
Before we deep into the theory of intelligence and perception in particular let's have a thought why they exists and what is they purpose by examining living organisms which posses those.
What is it that all known currently living organism share and all have in common? It is they presence. If we assume that all living things are driven by evolutionary processes then it means that from this point of view (almost) all the properties they have will support this basic one - the ability to survive (either by adapting or changing their environment).
So from now on we can consider intelligence as a tool with one purpose only – to help to survive (in terms of species, not necessarily individuals, evolution would disregard any other purpose). But it is not the only tool nor strategy to do so, all others co-existing on the same time and same environment, are equally good. One could say that e.g. playing the piano is just a random action and nothing to do with survival. Human lives in environment that have not just physical properties, but also a complex social one. So the question rather would be, why complex social interactions emerged and why it is so important for survival? Maybe the answer lays in the work of Danny Hillis on co-evolution1. Perhaps it is a strategy to avoid evolutionary decline of a species.
Now let’s think about how intelligence help to survive. While non-living things are subject to change from the environment, living things can intervene this process and act on this change in order to survive. Basic adaptation does not need a complicated control mechanism, more sophisticated requires more advance techniques, like identification of the situation and an adequate response. While less intelligence creatures could try to survive by outnumbering and randomly modifying itself (so some of them would meet required fitness), more intelligent organism could do the exact thing with much less number.
It appears that intelligence and perception as a part of it is an important tool to go from a more statistical survival to a more intentional survival.
How perception might work? Identifying the situation and pairing with the right action improves survivability. And those are inseparable. From an evolutionary point of view this means that this is not about as realistically and objectively describing the surrounding environment as possible, but extracting and interpreting those information from the input in the way which would help trigger successful actions in terms of survivability for the given organism. This is not independent from the organism and its surrounding. So, to me, intelligence is a successful pairing of a perception to an action. Solving problems are not different. Here, perception is identifying the problem and action means finding a solution for it. Perception is not just understanding the surroundings, but also understanding internal state of the organism, or internal state of the processes involved in intelligent behavior. As we can see perception is one of the key to intelligent behavior.
Also need to mention that back in time when computers were developed John Von Neumann and his team was coping with a problem to build a reliable machine from very unreliable parts, what evolution did was building a powerful and efficient system using very slow and relatively energy consuming parts.
Visual perception
In most biological systems, emitted or reflected light enters trough the lens into the retina where it is detected by photosensitive cells and transformed into neural signals in order to understand surroundings. So, why is this so challenging? One of the problem is the ambiguity between sources of retinal stimulation and the retinal images that are caused by those sources. This problem is often called inverse problem in optics when the exact same image in the retina could be created by an infinite number of object of different size, orientation and distance. We need to also consider combinations of source of light, reflection and absorption of the object and absorption of the environment between the object and the observer. And even a simple fact, that the photo-receptor generates the same neural signal for a number of combination of wavelength (color) and intensity. However, perception does not have to try to be as objective and realistic as possible - that is not its purpose - but trying to be as helpful as possible in the process of survival. The success of perception will be determined by a success of an action that came out as a result of this percept.
Figure 1 The squares marked A and B are the same shade of gray, yet the brain can (correctly) distinguish between them by understanding the whole scene based on past experiences
Another problem is how to effectively extract and interpret meaningful information from the vast amount of data entering the system in a robust way and a reasonable time.
The simplest possible visual information is an image uniformly composed of the same color and brightness. There is no need to send and process data from every photo-receptor separately but just from one of them. Also this information has low value and reliability. When the input image become more complex, what is enough to send to the system are just changes in color and brightness in space (gradients/edges) and time (movements). It is not just to reduce data but also to reduce some ambiguity. While a constant color or intensity significantly change if light conditions changes; changes on gradients remain under more broader change of light conditions (see Figure 1 - squares A and B are exactly the same shade of gray due those light conditions so indistinguishable by computers, edges defining those squares however are clearly detectable). Those individual edge points can be further hierarchically grouped to more complex structures and non relevant data can be further reduced (e.g. changing absolute position to relative one, adding position, scale, rotation invariance, or even more generalizing the object etc.).
In biological systems on and off center cells are used to detect these gradients. In computer vision Difference of Gaussian or Laplacian of Gaussian are commonly used for the same purpose. Do detect gradients from a broad range of magnitude (slope of change, not orientation) usually these filters are applied on each level of the pyramidal representation of an image. To reduce the amount of calculation I would suggest to use LoG filters of the same size, but with different sparsity and increase sparseness just when and where edge is not detected (avoiding computation redundancy). While often is mentioned that Gabor filters more resemble biological visual processing at early stages, they are much more computational costly without significant advantages (orientation of gradients we can get in different computationally more plausible way).
Computer Vision has also a very elegant algorithm called Hough Transform which is relevant to the Radon transform and is used to detect geometric features like straight lines, circles, ellipses that can be described either analytically or in case of Generalized Hough transform, it can detect an arbitrary shape described by its model. The main disadvantage is it speed.
Hough Transform
Let's have a look at a simple line detector. In image space the lines are described by an equation:
, or to also include vertical lines, by equation:
Figure 2.a, 2.b Points P1, P2 and Line L1 in image space (left) and parameter (Hough) space (right)
Every single point in image space is represented by a unique sinusoid in parameter (Hough) space, which represent a set of lines that crosses that point. In other hand, every point in Hough space represents a whole line in image space. A set of two or more points that form a straight line will produce sinusoids which cross at the (ρ, θ) for that line.
In general we can use different equations with different parameters to detect different features. Also adding additional parameter we can achieve e.g. size or orientation invariance but it will cost performance. I believe the same can be achieved also differently without loosing performance.
To avoid repeated calculations, most of Hough transform implementations pre-calculates values of sin(θ) and cos(θ) and reused them from a look-up table.
Randomized Hough Transform
It takes advantage of the fact that some analytical curves can be fully determined by a certain number of points on the curve. For example, a straight line can be determined by two points, and an ellipse (or a circle) can be determined by three points. Input points are randomly processed and when a candidate for an object is found all points representing that object can be eliminated from the input image which further reduces the amount of data processed.
It is worth to note that according to many scientific researches in neurobiology in the early stage of the visual process (in LGN), 90% of visual information is coming from the brain itself and just 10% from the retina as the image is being mostly created in brain rather than received the whole from the eye and just passively processed. And somehow this more resemble this Random Hough Transform algorithm than image processed by deep neural network system which is the most widely used visual recognition method used today. Similar conclusion are formulated not just among neurobiologist but also by AI scientists2.
Discrete Hough Transform
Optimizes calculation which takes into account that both image and Hough space are discrete space.
Randomized Discrete Hough Transform processed symbolically
NARS system is very good at handling symbolic information but not so good to process numerical information. While naturally we thing about image recognition as a set of data processing algorithms, and we think about data as numbers. Actually Gödel suggested that we should look at any mathematical formulas as mathematical symbols had no special meaning, but replaced them by a series of unique numbers, and to prove those formulas we should calculate the whole expressions arithmetically instead of trying to do it symbolically. In this way, having appropriate tools (Turing machine, lambda calculus, computers) we can mechanically prove mathematical formulas. We can do similar thing with our data processing. Why not replace numbers by uniquely assigned symbols which has no meaning on their own. The meaning will be defined by relation with other symbols. And this is where NARS (and similar systems) come in hand. Lets have an image of size 3x3 with a set of (potential edge) points P1, ..., P9 and lets define a Hough Space with θ = { 0, 1 /4π, 1/2π, 3/4π }. By applying equation (2) we get a set of points in Hough space H1,...,H16.
For an illustration, a small subset of Hough transform and inverse Hough transform for the image 3x3 from the example:
- T1: P1 → H3, H6, H9, H12
- T2: P5 → H2, H6, H10, H14
- T3: P9 → H1, H6, H11, H16
- T1-1: H6 → P1, P5, P9
Figure 3.a, 3.b Points P1, P2 and Line L1 in image space (left) and hough space (right)
What we can notice here is that we got rid of the coordinates and the transformation and inverse transformation remove the need to calculate anything but still give us the ability to use the efficiency of the Randomized Hough Transform.
So, H1,...,H16 represents lines in image space. We not just got rid of coordinate system, but the information is compressed without (unintentionally) loosing any details. Moreover, we can go further. As I mentioned earlier, normally when Hough transform is applied in computer vision, to achieve any sort of invariance (position, rotation, scale) additional parameters are added to the parameters space, but hugely compromising performance. We can achieve similar things just by adding additional set of transformations on top of previous one. We can combine terms representing different properties of an object from any image/parameter spaces, so we can be as general or specific as we want to be, and we can do it very effectively. A single term can represent a very general object and if we use it together with a term that has any specific information (location, rotation, scale) that led to the recognition of general term, than we have a term representing a general object but also with some specific properties.
A subset of transformation statements that illustrate how to build from simple features more complex and how to add or remove specificity:
- H1, H2, H3 → H⇒
H1, H2, H3 are specific horizontal lines, whereas H⇒ represent any of them. - H4, H5, H6, H7, H8 → H⇗
Same but with a slope of 45°. - H9, H10, H11 → H⇑
Vertical lines. - H12, H13, H14, H15, H16 → H⇖
Lines with a slope of 135°. - H⇒, H⇑ → H+
Perpendicular lines composed from horizontal and vertical line(s). - H⇗, H⇖ → Hx
Same, but rotated by 45°. - H+, Hx → HL
Perpendicular rotation invariant lines. - P1', P9', H6 → L1
A line segment with an exact location, P1' and P9' are end stop points, that are different from P1 and P9 which are just ordinary edge points. Line does not continue beyond them, so those points need to be specifically detected. Also a line segment without an exact location, with length and/or rotation invariance can be similarly constructed to use it in a more general way e.g. L1, L2... → Linv.
Another example could be to do Hough or similar transformations recursively. E.g. if we do a line Hough transform on a circle centered at the beginning of a coordinate system, we get a set of sinusoids in parameter space enveloped by two distinct horizontal lines ρ and -ρ. Applying the same transformation again, we get two points that represent that envelop in the previous parameter space, which represent a circle in image space. What this all resemble is how simple, complex and hyper-complex cells hierarchically represents objects as its been described in many research studies of neurobiologist. In those examples we used equation (2) from a line Hough transform, but in general we can use any equation, or even transformations that doesn't have any analytical form.
Because of discretization of the image space some question still need to be solved. How to effectively apply transformations on image space(s)? Using small partially overlapped regions, or using a global log polar coordinates system, a combination of those or something entirely different? In Cartesian coordinate system we highly compromising effectiveness and resolution
Now, one more question still remains, how to effectively implement all of this?
An efficient system running Symbolic Hough Transform
While NARS is very well designed to handle such a task, in my opinion, currently other system might easily outperform it in terms of speed. To try out the algorithm, I decided to create a new simplistic system, and mainly target performance, So I took NARS and tried to simplify to the extent just to be able to handle this single task and create a control mechanism that would target performance. Information will be still represented in a similar but simplistic way to NARS. Here nodes will represent terms and edges relations between terms. Also, ability to learn will remain, but might be different from the original NARS design. The control mechanism is inspired by Petri net. Petri net is a type of Discrete event dynamic system, which is a tool to model various concurrency and synchronization problems in distributed and parallel processing systems. Also some features of the system will be inspired by biological neural systems although those will not try to be accurate but its usage will be justified by its helpfulness in this particular context. It has some resemblance with neural networks, but the network is dynamic (new connections are added or removed on the fly) and its calculations are more efficient (it is not preformed in a global level, but just those parts where it is necessary). All in all – whatever works, no matter how bizarre it seems to be, is equally good (in my opinion) if we get the expected results.
Event Network
(to be refined)
Figure 4 Basic elements of the event network
Let's have a system with a set of nodes and edges, where nodes represent terms and edges relations between those terms. A node can be activated in a similar way than in Petri Nets, but there will be differences. Markings (in Petri Nets "tokens", here "potential" ui) will be held by edges, not nodes (places in Petri Nets) and they will be a real number from < -1., 1.> and will be a subject to a time degradation. There will be no special elements like transitions, here the firing will be directly on nodes and the process of firing will be called event which will have the same name as the node itself. The event will trigger computation - on every outgoing connections, ui will be updated, and checked whether it activates a node. When a node fires, it produces a potential with a value 1.0 to all its outgoing connections, but also looses its potential in all incoming connections (except on nodes with graded potential). The value of potential of all outgoing connection for the node that has been fired will be updated. The real value of the tokens on the connection will be 1.0 * wi added to existing value degraded by time (here Δt means time interval since last update of ui), and the sum of weight for all connection for a particular node will be also 1.0 or -1.0 (for wi < 0, to be able to block activation). The weights are not constant but changing depending on its usage. Also new connection can be made or weak one deleted. New connection will be made in a bit similar fashion than it is in NARS, although the rules will be not so strict. In neurobiology there is a theory that says "what fires together wires together.", so new connection will be made based on time and space proximity of nodes. Time proximity means, if events, representing firing nodes will often fire together (within small interval Δt) and they are spatially close – a connection will be made (weight of other connection related those nodes will be adjusted to keep sum of weight = 1.0). Firing of a node will occur when the sum of markings multiplied by the weights will be higher than a certain global) threshold. This marking can be also summed up during time, when multiple firing occurs on the same connection but the node, where this connection is entering, was not activated yet.. Some of input connection can be negations to block certain events and prevent further processing.
Action potential vs Graded potential: Action potential is a source of single event. Graded potential is a periodical generator of events, which frequency depends on the value of a grade – usually used for sensors.
Figure 5 A very simplified version of our line detector using Event Network emulating Randomized Hough Transform
Overall architecture of Visual Perception using Event Network
(to be refined)
Figure 6 Overall architecture of visual processing
Comparison to other systems
(to be refined)
Deep Neural networks/Convolutional Neural Networks, SIFT and similar, Classical Hough Although neural networks are widely used for image categorization, they have a lots of problems:
- Computes always everything (NN)
- loosing information (CNN)
- no relationship between detected local features (most of the techniques or algorithm)
- does not understand context (NN)
- after design, the structure remains fixed, after training weights remain fixed (NN)
Possible applications
(to be refined)
This system has a potential to be used not just for Visual perception or other kind of perceptions, perhaps NLP but also as a part of a universal problem solver. Mapping a problem to a different parameter space with a combination of effective techniques to find required solution with the ability to learn could be a powerful tool for various real world applications. Moreover it is simple enough to be easily implemented.
(to be extended)
1 Ramps-Antiramps and the red queen - An early genetic algorithm (From Charles Ofria )
2 Perception from an AGI Perspective (By Pei Wang and Patrick Hammer)
[Proceedings of AGI-18, Prague, Czech, August 2018]
Perception in AGI should be subjective, active, and unified with other cognitive processes