Compression of visual data into symbol-like descriptors in terms of a cognitive real-time vision system
Die Verdichtung der Videoeingabe in symbolische Deskriptoren im Rahmen des kognitiven Echtzeitvisionsystems
von Alexey Abramov
Datum der mündl. Prüfung:2012-07-18
Erschienen:2012-10-16
Betreuer:Prof. Dr. Florentin Wörgötter
Gutachter:Prof. Dr. Florentin Wörgötter
Gutachter:Prof. Dr. Winfried Kurth
Dateien
Name:abramov.pdf
Size:33.4Mb
Format:PDF
Zusammenfassung
Englisch
Humans have main senses: sight, hearing, touch, smell, and taste. Most of them combine several aspects. For example vision addresses at least three perceptual modalities: motion, color, and luminance. Extraction of these modalities begins in the human eye in the retinal network and the preprocessed signals enter the brain as streams of spatio-temporal patterns. As vision is our main sense, particularly for the perception of the three dimensional structure of the world around us, major eorts have been made to understand and simulate the visual system based on the knowledge collected to date. The research done over the last decades in elds of image processing and computer vision coupled with a tremendous step forward in hardware for parallel computing opened the door to building of so-called cognitive vision systems and for their incorporation into robots. The goal of any cognitive vision system is to transform visual input information into more descriptive representations than just color, motion, or luminance. Furthermore, in most robotic systems \live" interactions of robots with the environment are required, greatly increasing demands on the system. In such systems all pre-computations of the visual data need to be performed in real-time in order to be able to use the output data in the perception-action loop. Thus, a central goal of this thesis is to provide techniques which are strictly compatible with real-time computation. In the first part of this thesis we investigate possibilities for the powerful compression of the initial visual input data into symbol-like descriptors, upon which abstract logic or learning schemes can be applied. We introduce a new real-time video segmentation framework performing automatic decomposition of monocular and stereo video streams without use of prior knowledge on data and considering only preceding information. All entities in the scene, representing objects or their parts, are uniquely identied. In the second part of the thesis we make additional use of stereoscopic visual information and address the problem of establishing correspondences between two views of the scene solved with apparent ease in the human visual system (for images acquired with left and right eye). We exploit these correspondences in the stereo image pairs for the estimation of depth (distance) by proposing a novel disparity measurement technique based on extracted stereo-segments. This technique approximates shape and computes depth information for all entities found in the scene. The most important and novel achievement of this approach is that it produces reliable depth information for objects with weak texture where performance of traditional stereo techniques is very poor. In the third part of this thesis we employ an active sensor, producing indoors much more precise depth information encoded as range-data than any passive stereo technique. We perform fusion of image and range data for video segmentation which results in better results. By this we can now even handle fast moving objects, which was not possible so far. To address the real-time constraint, the proposed segmentation framework was accelerated on a Graphics Processing Unit (GPU) architecture using the parallel programming model of Compute Uni ed Device Architecture (CUDA). All introduced methods: segmentation of single images, segmentation of monocular and stereo video streams, depth-supported video segmentation, and disparity computation from stereosegment correspondences run in real-time for middle-size images and close to real-time for higher resolutions. In summary: The main result of this thesis is a framework which can produce a compact representation of any visual scene where all meaningful entities are uniquely identied, tracked, and important descriptors, such as shape and depth information, are extracted. The ability of the framework was successfully demonstrated in the context of several European projects (PACO-PLUS, Garnics, IntellAct, and Xperience). The developed real-time system is now employed as a robust visual front-end in various real-time robotic systems.
Keywords: Image processing; Computer vision; Video segmentation; Stereo vision
Weitere Sprachen
Sehen, Hören, Fühlen, Geruch und Geschmack
gehören zu den wichtigsten menschlichen Sinnen. Sie verbinden
verschiedene Aspekte: sehen bezieht sich zum Beispiel auf
mindestens drei wahrzunehmende Modalitäten: Bewegung, Farbe und
Intensität. Das Extrahieren von disen Modalitäten fängt im
menschlichen Auge im Retinalnetz an und die resultierenden Signale
dringen ins Gehirn als Datenketten von raumzeitlichen Mustern ein.
Sehen ist für uns der wichtigste Sinn für die Wahrnehmung von
dreidimensionalen Strukturen in der Welt um uns herum. Bis heute
wurden unzählige Versuche gemacht, die bis heute bekannten
Kenntnisse in einem künstlichen Visionsystem zu verstehen und zu
simulieren. Forschungsergebnisse aus den letzten Jahrzehnten in den
Bereichen der digitalen Bildverarbeitung und dem maschinellen
Sehen, verbunden mit den Fortschritten in Hardware für die
Parallelbearbeitung, ermöglichen den Aufbau des sogenannten
kognitiven Visionsystems und dessen Anwendung in Robotern. Das Ziel
eines kognitiven Visionsystems besteht in der Umwandlung der
visuellen Eingabe in die deskriptivere Darstellung. Darüber hinaus
benötigen die meisten Roboter “live
Schlagwörter: Bildverarbeitung; Maschinelles Sehen; Videosegmentierung; Stereo vision