Compression of visual data into symbol-like descriptors in terms of a cognitive real-time vision system

Abramov, Alexey

Die Verdichtung der Videoeingabe in symbolische Deskriptoren im Rahmen des kognitiven Echtzeitvisionsystems

von Alexey Abramov

Dissertation

Datum der mündl. Prüfung:2012-07-18

Erschienen:2012-10-16

Betreuer:Prof. Dr. Florentin Wörgötter

Gutachter:Prof. Dr. Florentin Wörgötter

Gutachter:Prof. Dr. Winfried Kurth

Zum Verlinken/Zitieren: http://dx.doi.org/10.53846/goediss-2539

Dateien

Name:abramov.pdf

Size:33.4Mb

Format:PDF

ViewOpen

Lizenzbestimmungen:

Zusammenfassung

Englisch

Humans have main senses: sight, hearing, touch, smell, and taste. Most of them combine several aspects. For example vision addresses at least three perceptual modalities: motion, color, and luminance. Extraction of these modalities begins in the human eye in the retinal network and the preprocessed signals enter the brain as streams of spatio-temporal patterns. As vision is our main sense, particularly for the perception of the three dimensional structure of the world around us, major eorts have been made to understand and simulate the visual system based on the knowledge collected to date. The research done over the last decades in elds of image processing and computer vision coupled with a tremendous step forward in hardware for parallel computing opened the door to building of so-called cognitive vision systems and for their incorporation into robots. The goal of any cognitive vision system is to transform visual input information into more descriptive representations than just color, motion, or luminance. Furthermore, in most robotic systems \live" interactions of robots with the environment are required, greatly increasing demands on the system. In such systems all pre-computations of the visual data need to be performed in real-time in order to be able to use the output data in the perception-action loop. Thus, a central goal of this thesis is to provide techniques which are strictly compatible with real-time computation. In the first part of this thesis we investigate possibilities for the powerful compression of the initial visual input data into symbol-like descriptors, upon which abstract logic or learning schemes can be applied. We introduce a new real-time video segmentation framework performing automatic decomposition of monocular and stereo video streams without use of prior knowledge on data and considering only preceding information. All entities in the scene, representing objects or their parts, are uniquely identied. In the second part of the thesis we make additional use of stereoscopic visual information and address the problem of establishing correspondences between two views of the scene solved with apparent ease in the human visual system (for images acquired with left and right eye). We exploit these correspondences in the stereo image pairs for the estimation of depth (distance) by proposing a novel disparity measurement technique based on extracted stereo-segments. This technique approximates shape and computes depth information for all entities found in the scene. The most important and novel achievement of this approach is that it produces reliable depth information for objects with weak texture where performance of traditional stereo techniques is very poor. In the third part of this thesis we employ an active sensor, producing indoors much more precise depth information encoded as range-data than any passive stereo technique. We perform fusion of image and range data for video segmentation which results in better results. By this we can now even handle fast moving objects, which was not possible so far. To address the real-time constraint, the proposed segmentation framework was accelerated on a Graphics Processing Unit (GPU) architecture using the parallel programming model of Compute Uni ed Device Architecture (CUDA). All introduced methods: segmentation of single images, segmentation of monocular and stereo video streams, depth-supported video segmentation, and disparity computation from stereosegment correspondences run in real-time for middle-size images and close to real-time for higher resolutions. In summary: The main result of this thesis is a framework which can produce a compact representation of any visual scene where all meaningful entities are uniquely identied, tracked, and important descriptors, such as shape and depth information, are extracted. The ability of the framework was successfully demonstrated in the context of several European projects (PACO-PLUS, Garnics, IntellAct, and Xperience). The developed real-time system is now employed as a robust visual front-end in various real-time robotic systems.

Keywords: Image processing; Computer vision; Video segmentation; Stereo vision

Weitere Sprachen

Sehen, Hören, Fühlen, Geruch und Geschmack gehören zu den wichtigsten menschlichen Sinnen. Sie verbinden verschiedene Aspekte: sehen bezieht sich zum Beispiel auf mindestens drei wahrzunehmende Modalitäten: Bewegung, Farbe und Intensität. Das Extrahieren von disen Modalitäten fängt im menschlichen Auge im Retinalnetz an und die resultierenden Signale dringen ins Gehirn als Datenketten von raumzeitlichen Mustern ein. Sehen ist für uns der wichtigste Sinn für die Wahrnehmung von dreidimensionalen Strukturen in der Welt um uns herum. Bis heute wurden unzählige Versuche gemacht, die bis heute bekannten Kenntnisse in einem künstlichen Visionsystem zu verstehen und zu simulieren. Forschungsergebnisse aus den letzten Jahrzehnten in den Bereichen der digitalen Bildverarbeitung und dem maschinellen Sehen, verbunden mit den Fortschritten in Hardware für die Parallelbearbeitung, ermöglichen den Aufbau des sogenannten kognitiven Visionsystems und dessen Anwendung in Robotern. Das Ziel eines kognitiven Visionsystems besteht in der Umwandlung der visuellen Eingabe in die deskriptivere Darstellung. Darüber hinaus benötigen die meisten Roboter “live

Schlagwörter: Bildverarbeitung; Maschinelles Sehen; Videosegmentierung; Stereo vision

Statistik