Perceptual Segmentation of Visual Streams by Tracking of Objects and Parts
by Jeremie Papon
Date of Examination:2014-10-17
Date of issue:2014-11-27
Advisor:Prof. Dr. Florentin Wörgötter
Referee:Prof. Dr. Florentin Wörgötter
Referee:Prof. Dr. Justus Piater
Files in this item
Name:thesis_out.pdf
Size:31.7Mb
Format:PDF
Abstract
English
The ability to parse visual streams into semantically meaningful entities is an essential element of intelligent systems. This process - known as segmentation - is a necessary precursor to high-level behavior which uses vision, such as identification of objects, scene understanding, and task planning. Tracking these segmented entities over time further enriches this knowledge by extending it to the action domain. This work proposes to establish a closed loop between video object segmentation and multi-target tracking to parse streaming visual data. We demonstrate the strengths of this approach, and show how such a framework can be used to distill basic semantic understanding of complex actions in real-time, without the need for a-priori object knowledge. Importantly, this framework is highly robust to occlusions, fast movements, and deforming objects. This thesis has four key contributions, each of which lead towards fast and robust video segmentation through tracking. First, we present Video Segmentation by Relaxation of Tracked Masks, which serves as a proof of concept, demonstrating the feasibility of Dynamic Segment Tracking in 2D video. This method serves as a demonstration of the viability of a feedback loop between Video Object Segmentation and Multi-Target Tracking. This is accomplished using a sequential Bayesian technique to generate predictions which are used to seed a segmentation kernel, the results of which are used to update tracked models. The second contribution consists of a 3D voxel clustering technique, Voxel Cloud Connectivity Segmentation, which makes use of a novel adjacency octree structure to efficiently cluster 3D point cloud data, and provide a graph lattice for the otherwise unstructured points. These clusters of voxels, or supervoxels, and their adjacency graph are used to maintain a world model which serves as an internal buffer for observations for trackers. Importantly, this world model uses ray-tracing to ensure that it does not delete occluded voxels as new frames of data arrive. The third contribution is a novel spatially stratified sampling technique for evaluating the likelihood function in particle filters. In particular, we show that in the case where the measurement function uses spatial correspondence, we can greatly reduce computational cost by exploiting spatial structure to avoid redundant computations. We present results which quantitatively show that the technique permits equivalent, and in some cases, greater accuracy, as a reference point cloud particle filter at significantly faster run-times. We also compare to a GPU implementation, and show that we can exceed their performance on the CPU. In addition, we present results on a multi-target tracking application, demonstrating that the increases in efficiency permit online 6DoF multi-target tracking on standard hardware. Our final contribution is Predictive Association of Supervoxels, which implements a closed loop between segmentation and tracking by minimizing a global energy function which scores supervoxel associations. The energy function is efficiently computed using the adjacency octree, with candidate associations provided by the 3D correspondence based particle filters. The association found determines a fully segmented point cloud, and is used to update the tracker models (as in VSRTM). This allows for the segmentation of temporally consistent supervoxels, avoiding the need to pre-define object models for segmentation. Each of these contributions has been implemented in live systems and run in an online streaming manner. We have performed quantitative evaluation on existing benchmarks to demonstrate state-of-the-art tracking and segmentation performance. In the 2D case, we compare against an existing tracking benchmark, and show that we can match their tracking performance, while in the 3D case we use a benchmark to show that we can outperform a GPU implementation. Finally, we give qualitative results in a robotic teaching application, and show that the system is able to parse real data and to distill semantic understanding from video.
Keywords: Video Segmentation; Point Clouds; Segmentation; Visual Tracking; Computer Vision