Spatio-temporal reasoning for semantic scene understanding and its application in recognition and prediction of manipulation actions in image sequences
by Fatemeh Ziaeetabar
Date of Examination:2019-05-07
Date of issue:2020-04-21
Advisor:Prof. Dr. Florentin Wörgötter
Referee:Prof. Dr. Florentin Wörgötter
Referee:Prof. Dr. Ricarda I. Schubotz
Referee:Prof. Dr. Dieter Hogrefe
Referee:Prof. Dr. Marcus Baum
Referee:Prof. Dr. Carsten Damm
Referee:Prof. Dr. Wolfgang May
Files in this item
Format:PDFDescription:This is the main file of my PhD dissertation.
EnglishHuman activity understanding has attracted much attention in recent years, due to a key role in a wide range of applications and devices, such as human- computer interfaces, visual surveillance, video indexing, intelligent humanoid robots, ambient intelligence and more. Of particular relevance, performing manipulation actions has a significant importance due to its enormous use, especially for service, as well as industrial robots. These robots strongly benefit from a fast and predictive recognition of manipulation actions. Although, for us as humans performing these actions is a quite trivial function, however this is not necessarily the case for a robot. To address this problem, in this thesis, we propose an approach for the representation, as well as an algorithm for the recognition and prediction of manipulation action categories, as observed in videos. The key contributions of this thesis are the following: First, we modeled each object as a simple axis aligned bounding box and provide a qualitative spatial reasoning method to calculate static and dynamic spatial relationships, accordingly. Static relations depend on the relative spatial position of two objects, including ``Above'', ``Below'', ``Right'', ``Left'', ``Front'', ``Back'', ``Inside'', ``Surround'', ``Around without touch'', ``Around with touch'', ``Top'' and ``Bottom''; while dynamic relations address the spatial relation of two objects during movement of either or both of them. These relations consist of ``Getting close'', ``Moving apart'', ``Stable'', ``Moving together'', ``Halting together'' and ``Fixed moving together''. This qualitative approach allows us to provide a new semantic representation of manipulation actions, creating a sequence of static and dynamic spatial relations between the manipulated objects taking part in a manipulation. Our approach creates a transition matrix, called the ``Enriched Semantic Event Chain (ESEC)''. The rows of this matrix show spatio-temporal relations include touching/ not-touching (rows 1:10), static (rows 11:20) and dynamic (rows 21:30) relations within each pair of manipulated objects, while the columns of the matrix contain events that occur as a result of one or more change(s) in the spatio-temporal relations between the involved objects. Since the presence of noise as well as inappropriate accuracy in object modeling may lead to errors in the calculation of spatio-temporal relations, our framework has been adapted to the algorithm of noise identification and correction. Second, we designed clustering and classification algorithms according to the ESEC framework, to distinguish and recognize manipulation actions. To this end, we introduced a novel method to calculate the similarity between manipulation actions. Our algorithm is validated on a data-set including 120 scenarios of 8 action types obtaining an accuracy of 95\%. Third, the ESEC framework is employed to predict a large set of manipulations in theoretical as well as real data. Our method could correctly predict manipulation actions after only (on average) 45% of their execution was accomplished, which is twice as fast as a standard Hidden Markov Model based method. This claim, was tested on 35 theoretically defined manipulations as well as two publicly available data-sets consisting of a total of 162 scenarios in 12 action types. Finally, we designed a cognitive experiment to examine the prediction of manipulation actions in a virtual reality-based environment. To this end, we selected 10 actions distributed in all possible groups and subgroups of manipulations. Next, we designed and created 300 scenarios of these actions, producing a large data-set of manipulation actions in a virtual reality environment. To our knowledge, this is the first virtual reality data-set of human manipulation actions, aimed at helping AI scientists studying human action recognition. In the next step, we performed an experiment where 50 human subjects participated in, and were asked to predict the type of action in each scenario, before it ends. Our ESEC-based prediction method was applied on these scenarios, proving capable of predicting the manipulation actions as good as 17.6% faster than the human participants. The main advantage of our proposed framework, ESEC, is that it is capable of encoding a manipulation in a highly invariant and abstract way, independent from object poses, perspectives and trajectories which could largely interchange. In fact, ESECs help resolve the problem of action representation under conditions where clutter and big scenes induce complexities in the analysis of scaled matrices. Different from model-based policy designs, our model-free framework operates on spatio-temporal object relations without making assumptions on the structure of objects and scenes. This new form of representation, enables us to provide the novel recognition and prediction algorithms for manipulation actions, leading to a high efficiency.
Keywords: Human activity recognition and prediction; Manipulation actions; Semantic scene understanding; Spatial reasoning; Robotics