Visual Perception of Objects and their Parts in Artificial Systems

Schoeler, Markus

dc.contributor.advisor	Wörgötter, Florentin Prof. Dr.
dc.contributor.author	Schoeler, Markus
dc.date.accessioned	2015-11-02T09:19:10Z
dc.date.available	2015-11-02T09:19:10Z
dc.date.issued	2015-11-02
dc.identifier.uri	http://hdl.handle.net/11858/00-1735-0000-0023-9669-A
dc.identifier.uri	http://dx.doi.org/10.53846/goediss-5349
dc.language.iso	eng	de
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject.ddc	510	de
dc.title	Visual Perception of Objects and their Parts in Artificial Systems	de
dc.type	doctoralThesis	de
dc.contributor.referee	Wörgötter, Florentin Prof. Dr.
dc.date.examination	2015-10-12
dc.description.abstracteng	Humans are able to perceive their surrounding apparently with ease. Without much thinking we can process the complex visual stream into meaningful entities which we call objects. How we do this remains an open question already addressed by years of research. Still, there exists a general consensus that (so-called) Visual Object Perception is one of the most fundamental abilities of intelligent agents to make sense of their environment. In this thesis we advocate the idea that Visual Object Perception can be decomposed into three concurrent ways of perceiving objects: Instance, category, and function perception. This decomposition emanates from the idea that perception is inseparably intertwined with actions and tasks. If actions require a specific object (e.g., fill this tea into my teddy-bear cup), one starts perceiving available objects at the instance level. If the task asks for a generic cup (e.g., go to the supermarket and buy some cups), agents need to perceive objects at the category level, without caring for the exact instances. Finally, the function level is used when objects are defined by the task itself instead of a specific category name. For example, transport water from A to B (1) or bore a hole into the soil for seeding plants(2). Both tasks define objects by the role they have in the action context, i.e., a fillable object (1) and an object to poke/bore into the soil (2), respectively. Especially having mastered function level perception was a step in our cognitive evolution which enabled early hominids during the advent of humankind to make sense of their environment and use objects as tools. Eventually, this allowed us to build better tools driven by human ingenuity which separates us from all other animals. In order to make a machine interact with objects in a “human-like” way, we see two questions which need to be addressed: First, what objects do I see and, second, how can I manipulate or use these objects? The former requires label assignment (e.g., classification, recognition), the latter requires to estimate the orientation and location (pose) of recognized objects in order to correctly apply motor behavior to use them. Depending on the required perception level (i.e., instance, category, or function), both problems need to be treated with different approaches. Consequently, there is a total of 6 sub-problems (2 problems * 3 perception levels): Instance Recognition, Object Categorization, and Object Function Assignment; Pose Estimation of instances, Pose Estimation at the category level, and Pose Estimation at the function level. In this thesis we contribute to Instance Recognition, Object Categorization, Object Function Assignment, and Pose Estimation at the category level. While not published at the time of submission of this thesis, we also discuss a small preliminary study about Pose Estimation at the function level at the end of this thesis. For Instance Recognition all objects in the environment are uniquely defined and need to be discriminated. This requires all objects to be recorded and learned before a system is able to recognize them. As a consequence, it limits agents to specific environments; moving a machine to a new environment would require a new training set and more training. To solve this problem, we present a method which is able to automatically record a training set from a scene with minimal human supervision. Moreover, to deal with highly visual similar objects (e.g., two similar looking cups) we develop an algorithm which is highly discriminative, while being robust to illumination, scale, and object-rotation. At the category level we treat Object Categorization as well as Pose Estimation. As rich models like Deep Convolutional Neural Networks have become de facto standard in modern Object Categorization systems, huge amounts of relevant training data are required. Our first contribution, the TransClean algorithm, is able to generate such large sets of relevant training images for categories, while also dealing with ambiguous category names. For example: The categories apple, nut, and washer are ambiguous (polysemes), because they can refer to different objects: Apple refers to a notebook or the fruit; nut to the hardware or the fruit; washer to the hardware or a washing-machine. The general idea is that this ambiguity usually does not exist in other languages. For example, washer translates to the German words “Waschmaschine” (the washing-machine) and “Unterlegscheibe” (the hardware) - the ambiguity does not exist here. TransClean uses this idea to sort out irrelevant images retrieved from online word-tagged image databases (e.g., Google Image-Search) by comparing images retrieved for different languages. The second contribution aims at treating the challenging task of Pose Estimation at the category level. This is complicated, because the system cannot align stored models to recorded known instances in the scene (which is done for Pose Estimation of instances). We treat this by introducing a Deep Convolutional Neural Network which not only predicts the category but also the category pose of objects. The need for a large set of annotated training data is met by synthesizing cluttered indoor scenes. Lastly, the function level is determined by treating objects not as a whole but, instead, as an aggregation of parts in specific constellations. First, we present three sequential algorithms for segmenting a scene into objects and objects into their parts. Second, we develop a framework which analyses the parts and part-constellations to learn the function of each part (e.g., being a blade or a tip) together with the function of the object as a whole (e.g., being something for cutting, drilling). Interestingly, objects and their parts can possess multiple functions. For example, a hammer-like object can be used to hit a nail or it can be used as a makeshift replacement for task (2), defined earlier: Bore a hole into the soil for seeding plants, now, using the handle as the tool-end. All the work presented in this thesis has been systematically evaluated using existing or new benchmarks and proved better than state-of-the-art in their respective tasks. The comprehensive treatment of Artificial Visual Object Perception which we introduce in this thesis has widespread application in various scenarios including robots in human healthcare, house-hold robots, and robots for emergency response (e.g., disaster zones). For example, it allows for new problem solving strategies in agents. Instead of looking for a predefined and hard-coded object which solves a task, agents can perceive objects at, for example, the function level and propose creative solutions: Use a hammer to bore a hole into soil or push a button which is out of reach; use a boot or a helmet to transport water.	de
dc.contributor.coReferee	Guerin, Frank Dr.
dc.contributor.thirdReferee	Krüger, Norbert Prof. Dr.
dc.subject.eng	Object Recognition	de
dc.subject.eng	Object Segmentation	de
dc.subject.eng	Object Partitioning	de
dc.subject.eng	Function Recognition	de
dc.subject.eng	Pose Estimation	de
dc.subject.eng	Scene Segmentation	de
dc.subject.eng	Object Categorization	de
dc.subject.eng	Object Classification	de
dc.identifier.urn	urn:nbn:de:gbv:7-11858/00-1735-0000-0023-9669-A-0
dc.affiliation.institute	Fakultät für Mathematik und Informatik	de
dc.subject.gokfull	Informatik (PPN619939052)	de
dc.identifier.ppn	838214665

Dateien

Name:Schoeler_Thesis_fast.pdf

Größe:85.82Mb

Format:PDF

Beschreibung:Thesis

Öffnen

Name:: Schoeler_Thesis_fast.pdf
Größe:: 85.82Mb
Format:: PDF
Beschreibung:: Thesis

Öffnen

Das Dokument erscheint in:

Fakultät für Mathematik und Informatik (inkl. GAUSS) [518]

Zur Kurzanzeige