dc.description.abstracteng | The autonomy of robots is a prominent research focus in the field of robotics. A key aspect is endowing robots with the ability to accurately perceive the environment and to plan for specific tasks, which poses a considerable challenge. Conventional AI planning requires a precise definition of symbolic structuring elements, including planning domains, operators, pre-and post-conditions, as well as search/planning algorithms. In contrast, humans in daily life demonstrate rapid and effective planning across diverse scenarios and tasks without employing symbolic operations. This ability arises from our capacity for mental imagery based on visual information and past experiences. Humans construct a sequence of mental images to formulate a plan and then – on execution – engage in frequent success-checking and re-planning by updating these images. In recent years, fueled by the rapid advancements in deep learning, particularly in computer vision, the prospect of robots leveraging visual information for planning, akin to humans, has surfaced. The focus of this thesis is the artificial synthesis of simulated mental imagery and its application in robotic planning.
Firstly, we regard generating simulated mental imagery as a task of predicting future scenes. The objective of this task is to predict future scenes based on the initial scene and the positions of objects before and after movement. To address this, we propose a method that leverages edge information from images and employs a deep generative network to generate the predicted future scenes. Experimental results on simulated datasets demonstrate that our method can produce high-quality future scene images. Moreover, we find that the network, during this straightforward training process, can learn implicit 3D information to some extent. However, directly using the generative model to predict future scenes may not always result in clear structures, especially when objects overlap with each other after movement. This challenge arises because the model learns scene representations implicitly, and its performance heavily relies on the scale of the training dataset.
Due to the challenge of directly predicting future scenes discussed above, we divide the prediction into several simpler subtasks: object detection, affordance and semantic segmentation, object completion, and train a network for each sub-task separately. We combine regular convolutional neural networks and generative adversarial networks therein. Based on that, we propose a method called Simulated Mental Imagery for Planning (SiMIP) to utilize simulated mental imagery for robotic planning tasks. It consists of perception, simulated action, success checking, and re-planning performed on ’imagined’ images. Our method relies entirely on images, thus avoiding the constraint of symbolization compared to traditional AI planning. Furthermore, the generated plan consists of a sequence of simulated images that remain interpretable by humans. We create a data set from real scenes for a packing problem of having to correctly place different objects into different target slots. The experimental results indicate that plans generated through our method achieve a success rate of 90.92\% (compared to the random baseline of 24.00\%). Thus, we show that it is possible to implement mental imagery-based planning in an algorithmically sound way by combining regular convolutional neural networks and generative adversarial networks. The main limitation of our method is the requirement for the use of a top-view camera.
Therefore, we then investigate how to translate frontal view images to bird's eye view (BEV) images. BEV can be regarded as obtained from a top-view camera using orthographic projection. Existing datasets for utilizing BEV images primarily focus on routes for the field of autonomous driving. Therefore, we create an indoor table scene dataset that comprises scenes featuring a table with randomly placed objects on top of it. We propose a method that employs inverse perspective mapping (IPM) as an intermediate step to facilitate the network in learning the transformation from the frontal view to BEV. Experimental results on our dataset indicate that our approach can produce BEV images that preserve the physical shapes of objects, generating natural and high-quality representations. These synthetic BEV images can then serve as a foundation for the mental imagery formation and planning discussed above, but also for numerous other downstream perception tasks, such as object detection, or semantic segmentation.
Additionally, we streamline the previously proposed SiMIP system and deploy it in a real-world environment. Here, we divide the SiMIP process into two modules: scene decomposition and affordance-based search. Scene decomposition identifies and completes entities within the scene, analyzing affordances for each entity. Affordance-based search then explores feasible plans based on the scene information obtained in the previous step. We also deploy this method in the real-world environment and introduce an execution and feedback processing flow for online planning. Here, in case of a discrepancy between the mental plan and the scene obtained after the real action re-planning is induced until the task is complete. In conclusion, our approach is more flexible and adaptable to task planning in dynamic scenarios compared to conventional symbolic methods. | de |