Navigation ▼

Show simple item record

dc.contributor.advisor Wörgötter, Florentin Prof. Dr.
dc.contributor.author Schoeler, Markus
dc.date.accessioned 2015-11-02T09:19:10Z
dc.date.available 2015-11-02T09:19:10Z
dc.date.issued 2015-11-02
dc.identifier.uri http://hdl.handle.net/11858/00-1735-0000-0023-9669-A
dc.language.iso eng de
dc.rights.uri http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject.ddc 510 de
dc.title Visual Perception of Objects and their Parts in Artificial Systems de
dc.type doctoralThesis de
dc.contributor.referee Wörgötter, Florentin Prof. Dr.
dc.date.examination 2015-10-12
dc.description.abstracteng Humans are able to perceive their surrounding apparently with ease. Without much thinking we can process the complex visual stream into meaningful entities which we call objects. How we do this remains an open question already addressed by years of research. Still, there exists a general consensus that (so-called) Visual Object Perception is one of the most fundamental abilities of intelligent agents to make sense of their environment. In this thesis we advocate the idea that Visual Object Perception can be decomposed into three concurrent ways of perceiving objects: Instance, category, and function perception. This decomposition emanates from the idea that perception is inseparably intertwined with actions and tasks. If actions require a specific object (e.g., fill this tea into my teddy-bear cup), one starts perceiving available objects at the instance level. If the task asks for a generic cup (e.g., go to the supermarket and buy some cups), agents need to perceive objects at the category level, without caring for the exact instances. Finally, the function level is used when objects are defined by the task itself instead of a specific category name. For example, transport water from A to B (1) or bore a hole into the soil for seeding plants(2). Both tasks define objects by the role they have in the action context, i.e., a fillable object (1) and an object to poke/bore into the soil (2), respectively. Especially having mastered function level perception was a step in our cognitive evolution which enabled early hominids during the advent of humankind to make sense of their environment and use objects as tools. Eventually, this allowed us to build better tools driven by human ingenuity which separates us from all other animals. In order to make a machine interact with objects in a “human-like” way, we see two questions which need to be addressed: First, what objects do I see and, second, how can I manipulate or use these objects? The former requires label assignment (e.g., classification, recognition), the latter requires to estimate the orientation and location (pose) of recognized objects in order to correctly apply motor behavior to use them. Depending on the required perception level (i.e., instance, category, or function), both problems need to be treated with different approaches. Consequently, there is a total of 6 sub-problems (2 problems * 3 perception levels): Instance Recognition, Object Categorization, and Object Function Assignment; Pose Estimation of instances, Pose Estimation at the category level, and Pose Estimation at the function level. In this thesis we contribute to Instance Recognition, Object Categorization, Object Function Assignment, and Pose Estimation at the category level. While not published at the time of submission of this thesis, we also discuss a small preliminary study about Pose Estimation at the function level at the end of this thesis. For Instance Recognition all objects in the environment are uniquely defined and need to be discriminated. This requires all objects to be recorded and learned before a system is able to recognize them. As a consequence, it limits agents to specific environments; moving a machine to a new environment would require a new training set and more training. To solve this problem, we present a method which is able to automatically record a training set from a scene with minimal human supervision. Moreover, to deal with highly visual similar objects (e.g., two similar looking cups) we develop an algorithm which is highly discriminative, while being robust to illumination, scale, and object-rotation. At the category level we treat Object Categorization as well as Pose Estimation. As rich models like Deep Convolutional Neural Networks have become de facto standard in modern Object Categorization systems, huge amounts of relevant training data are required. Our first contribution, the TransClean algorithm, is able to generate such large sets of relevant training images for categories, while also dealing with ambiguous category names. For example: The categories apple, nut, and washer are ambiguous (polysemes), because they can refer to different objects: Apple refers to a notebook or the fruit; nut to the hardware or the fruit; washer to the hardware or a washing-machine. The general idea is that this ambiguity usually does not exist in other languages. For example, washer translates to the German words “Waschmaschine” (the washing-machine) and “Unterlegscheibe” (the hardware) - the ambiguity does not exist here. TransClean uses this idea to sort out irrelevant images retrieved from online word-tagged image databases (e.g., Google Image-Search) by comparing images retrieved for different languages. The second contribution aims at treating the challenging task of Pose Estimation at the category level. This is complicated, because the system cannot align stored models to recorded known instances in the scene (which is done for Pose Estimation of instances). We treat this by introducing a Deep Convolutional Neural Network which not only predicts the category but also the category pose of objects. The need for a large set of annotated training data is met by synthesizing cluttered indoor scenes. Lastly, the function level is determined by treating objects not as a whole but, instead, as an aggregation of parts in specific constellations. First, we present three sequential algorithms for segmenting a scene into objects and objects into their parts. Second, we develop a framework which analyses the parts and part-constellations to learn the function of each part (e.g., being a blade or a tip) together with the function of the object as a whole (e.g., being something for cutting, drilling). Interestingly, objects and their parts can possess multiple functions. For example, a hammer-like object can be used to hit a nail or it can be used as a makeshift replacement for task (2), defined earlier: Bore a hole into the soil for seeding plants, now, using the handle as the tool-end. All the work presented in this thesis has been systematically evaluated using existing or new benchmarks and proved better than state-of-the-art in their respective tasks. The comprehensive treatment of Artificial Visual Object Perception which we introduce in this thesis has widespread application in various scenarios including robots in human healthcare, house-hold robots, and robots for emergency response (e.g., disaster zones). For example, it allows for new problem solving strategies in agents. Instead of looking for a predefined and hard-coded object which solves a task, agents can perceive objects at, for example, the function level and propose creative solutions: Use a hammer to bore a hole into soil or push a button which is out of reach; use a boot or a helmet to transport water. de
dc.contributor.coReferee Guerin, Frank Dr.
dc.contributor.thirdReferee Krüger, Norbert Prof. Dr.
dc.subject.eng Object Recognition de
dc.subject.eng Object Segmentation de
dc.subject.eng Object Partitioning de
dc.subject.eng Function Recognition de
dc.subject.eng Pose Estimation de
dc.subject.eng Scene Segmentation de
dc.subject.eng Object Categorization de
dc.subject.eng Object Classification de
dc.identifier.urn urn:nbn:de:gbv:7-11858/00-1735-0000-0023-9669-A-0
dc.affiliation.institute Fakultät für Mathematik und Informatik de
dc.subject.gokfull Informatik (PPN619939052) de
dc.identifier.ppn 838214665

Files in this item

This item appears in the following Collection(s)

Show simple item record