dc.description.abstracteng | Humans are able to perceive their surrounding apparently with ease. Without much
thinking we can process the complex visual stream into meaningful entities which
we call objects. How we do this remains an open question already addressed by years of research.
Still, there exists a general consensus that (so-called) Visual Object Perception is one
of the most fundamental abilities of intelligent agents to make sense of their environment.
In this thesis we advocate the idea that Visual Object Perception can be decomposed into
three concurrent ways of perceiving objects: Instance, category, and function perception. This
decomposition emanates from the idea that perception is inseparably intertwined with actions
and tasks. If actions require a specific object (e.g., fill this tea into my teddy-bear cup), one
starts perceiving available objects at the instance level. If the task asks for a generic cup (e.g.,
go to the supermarket and buy some cups), agents need to perceive objects at the category
level, without caring for the exact instances. Finally, the function level is used when objects are
defined by the task itself instead of a specific category name. For example, transport water from
A to B (1) or bore a hole into the soil for seeding plants(2). Both tasks define objects by the
role they have in the action context, i.e., a fillable object (1) and an object to poke/bore into
the soil (2), respectively.
Especially having mastered function level perception was a step in our cognitive evolution
which enabled early hominids during the advent of humankind to make sense of their environment
and use objects as tools. Eventually, this allowed us to build better tools driven by
human ingenuity which separates us from all other animals.
In order to make a machine interact with objects in a “human-like” way, we see two questions
which need to be addressed: First, what objects do I see and, second, how can I manipulate
or use these objects? The former requires label assignment (e.g., classification, recognition),
the latter requires to estimate the orientation and location (pose) of recognized objects in order to correctly apply motor behavior to use them. Depending on the required perception
level (i.e., instance, category, or function), both problems need to be treated with different
approaches. Consequently, there is a total of 6 sub-problems (2 problems * 3 perception levels):
Instance Recognition, Object Categorization, and Object Function Assignment; Pose
Estimation of instances, Pose Estimation at the category level, and Pose Estimation at the
function level. In this thesis we contribute to Instance Recognition, Object Categorization,
Object Function Assignment, and Pose Estimation at the category level. While not published
at the time of submission of this thesis, we also discuss a small preliminary study about Pose
Estimation at the function level at the end of this thesis.
For Instance Recognition all objects in the environment are uniquely defined and need to
be discriminated. This requires all objects to be recorded and learned before a system is able
to recognize them. As a consequence, it limits agents to specific environments; moving a machine
to a new environment would require a new training set and more training. To solve
this problem, we present a method which is able to automatically record a training set from a
scene with minimal human supervision. Moreover, to deal with highly visual similar objects
(e.g., two similar looking cups) we develop an algorithm which is highly discriminative, while
being robust to illumination, scale, and object-rotation.
At the category level we treat Object Categorization as well as Pose Estimation. As rich
models like Deep Convolutional Neural Networks have become de facto standard in modern
Object Categorization systems, huge amounts of relevant training data are required. Our first
contribution, the TransClean algorithm, is able to generate such large sets of relevant training
images for categories, while also dealing with ambiguous category names. For example: The categories
apple, nut, and washer are ambiguous (polysemes), because they can refer to different
objects: Apple refers to a notebook or the fruit; nut to the hardware or the fruit; washer to the
hardware or a washing-machine. The general idea is that this ambiguity usually does not exist
in other languages. For example, washer translates to the German words “Waschmaschine”
(the washing-machine) and “Unterlegscheibe” (the hardware) - the ambiguity does not exist
here. TransClean uses this idea to sort out irrelevant images retrieved from online word-tagged
image databases (e.g., Google Image-Search) by comparing images retrieved for different languages. The second contribution aims at treating the challenging task of Pose Estimation at the
category level. This is complicated, because the system cannot align stored models to recorded
known instances in the scene (which is done for Pose Estimation of instances). We treat this
by introducing a Deep Convolutional Neural Network which not only predicts the category
but also the category pose of objects. The need for a large set of annotated training data is met
by synthesizing cluttered indoor scenes.
Lastly, the function level is determined by treating objects not as a whole but, instead, as an
aggregation of parts in specific constellations. First, we present three sequential algorithms for
segmenting a scene into objects and objects into their parts. Second, we develop a framework
which analyses the parts and part-constellations to learn the function of each part (e.g., being
a blade or a tip) together with the function of the object as a whole (e.g., being something
for cutting, drilling). Interestingly, objects and their parts can possess multiple functions. For
example, a hammer-like object can be used to hit a nail or it can be used as a makeshift replacement
for task (2), defined earlier: Bore a hole into the soil for seeding plants, now, using the
handle as the tool-end.
All the work presented in this thesis has been systematically evaluated using existing or new
benchmarks and proved better than state-of-the-art in their respective tasks.
The comprehensive treatment of Artificial Visual Object Perception which we introduce in
this thesis has widespread application in various scenarios including robots in human healthcare,
house-hold robots, and robots for emergency response (e.g., disaster zones). For example,
it allows for new problem solving strategies in agents. Instead of looking for a predefined
and hard-coded object which solves a task, agents can perceive objects at, for example, the
function level and propose creative solutions: Use a hammer to bore a hole into soil or push a
button which is out of reach; use a boot or a helmet to transport water. | de |